Maven and JUnit Project – Extracting Content and Metadata via Apache Tika

In the software industry, contents are transferred or portable via various documents with formats like TXT, XLS, or PDF or even sometimes with MP4 format also. As multiple formats are in use, there should be a common way to extract the content and metadata from them. It can be possible via Apache Tika, a powerful versatile library for content analysis. As an introduction, let us see how parsing can be done and get the contents, nature of the document, etc., by going through the features of Apache Tikka. Via a sample maven project, let us see them.

Example Maven Project

Project Structure:

First and foremost thing is we need to see the dependencies required for Apache Tika and those need to be specified in pom.xml

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.17</version>
</dependency>

For our project, let us see all the dependencies via

pom.xml

XML

<?xml version="1.0" encoding="UTF-8"?> 

<project xmlns="http://maven.apache.org/POM/4.0.0"

    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0  

                        http://maven.apache.org/xsd/maven-4.0.0.xsd"> 

    <modelVersion>4.0.0</modelVersion> 

    <artifactId>apache-tika</artifactId> 

    <version>0.0.1-SNAPSHOT</version> 

    <name>apache-tika</name> 

    <parent> 

        <groupId>com.gfg</groupId> 

        <artifactId>parent-modules</artifactId> 

        <version>1.0.0-SNAPSHOT</version> 

    </parent> 

    <dependencies> 

        <dependency> 

            <groupId>org.apache.tika</groupId> 

            <artifactId>tika-parsers</artifactId> 

            <version>${tika.version}</version> 

        </dependency> 

    </dependencies> 

    <properties> 

        <tika.version>1.17</tika.version> 

    </properties> 

</project>

Heart of apache tika is Parser API. While parsing the documents mostly Apache POI or PDFBox will be used

void parse(
  InputStream inputStream, // This is the input document that need to be parsed
  ContentHandler contentHandler, // handler by processing export the result in a particular form
  Metadata metadata, //metadata properties
  ParseContext parseContext // for customizing parsing process) 
  throws IOException, SAXException, TikaException

Document type detection can be done by using an implementation class of the Detector interface. The below-mentioned method is available here

MediaType detect(java.io.InputStream inputStream, Metadata metadata) throws IOException

Language detection can also be done by Tika and identification of language is done without the help of metadata information. Now via the sample project java file contents, let us cover the topic as well

SampleTikaAnalysis.java

In this program, the following ways are handled both in detector and facade pattern

Detecting document type
Extracting the content using a parser and facade
Extracting the metadata using a parser and facade

Java

import java.io.IOException; 

import java.io.InputStream; 

import org.apache.tika.Tika; 

import org.apache.tika.detect.DefaultDetector; 

import org.apache.tika.detect.Detector; 

import org.apache.tika.exception.TikaException; 

import org.apache.tika.metadata.Metadata; 

import org.apache.tika.mime.MediaType; 

import org.apache.tika.parser.AutoDetectParser; 

import org.apache.tika.parser.ParseContext; 

import org.apache.tika.parser.Parser; 

import org.apache.tika.sax.BodyContentHandler; 

import org.xml.sax.ContentHandler; 

import org.xml.sax.SAXException; 

public class SampleTikaAnalysis { 

    // Detecting the document type by using Detector 

    public static String detectingTheDocTypeByUsingDetector(InputStream inputStream) throws IOException { 

        Detector detector = new DefaultDetector(); 

        Metadata metadata = new Metadata(); 

        MediaType mediaType = detector.detect(inputStream, metadata); 

        return mediaType.toString(); 

    } 

    // Detecting the document type by using Facade 

    public static String detectDocTypeUsingFacade(InputStream inputStream) throws IOException { 

        Tika tika = new Tika(); 

        String mediaType = tika.detect(inputStream); 

        return mediaType; 

    } 

    public static String extractContentUsingParser(InputStream inputStream) throws IOException, TikaException, SAXException { 

        Parser parser = new AutoDetectParser(); 

        ContentHandler contentHandler = new BodyContentHandler(); 

        Metadata metadata = new Metadata(); 

        ParseContext context = new ParseContext(); 

        parser.parse(inputStream, contentHandler, metadata, context); 

        return contentHandler.toString(); 

    } 

    public static String extractContentUsingFacade(InputStream inputStream) throws IOException, TikaException { 

        Tika tika = new Tika(); 

        String content = tika.parseToString(inputStream); 

        return content; 

    } 

    public static Metadata extractMetadatatUsingParser(InputStream inputStream) throws IOException, SAXException, TikaException { 

        Parser parser = new AutoDetectParser(); 

        ContentHandler contentHandler = new BodyContentHandler(); 

        Metadata metadata = new Metadata(); 

        ParseContext context = new ParseContext(); 

        parser.parse(inputStream, contentHandler, metadata, context); 

        return metadata; 

    } 

    public static Metadata extractMetadatatUsingFacade(InputStream inputStream) throws IOException, TikaException { 

        Tika tika = new Tika(); 

        Metadata metadata = new Metadata(); 

        tika.parse(inputStream, metadata); 

        return metadata; 

    } 
}

Let us test the above concepts by taking 3 documents namely exceldocument.xlsx, pdfdocument.txt, and worddocument.docx. They should be available under the test/resources folder so that they can be read from the mentioned way in the code. Let us test the contents now via

SampleTikaWayUnitTest.java

Java

import static org.hamcrest.CoreMatchers.containsString; 

import static org.junit.Assert.assertEquals; 

import static org.junit.Assert.assertThat; 

import java.io.IOException; 

import java.io.InputStream; 

import org.apache.tika.exception.TikaException; 

import org.apache.tika.metadata.Metadata; 

import org.junit.Test; 

import org.xml.sax.SAXException; 

public class SampleTikaWayUnitTest { 

    @Test

    public void withDetectorFindingTheResultTypeAsDocumentType() throws IOException { 

        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("pdfdocument.txt"); 

        String resultantMediaType = SampleTikaAnalysis.detectingTheDocTypeByUsingDetector(inputStream); 

        assertEquals("application/pdf", resultantMediaType); 

        inputStream.close(); 

    } 

    @Test

    public void withFacadeFindingTheResultTypeAsDocumentType() throws IOException { 

        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("pdfdocument.txt"); 

        String resultantMediaType = SampleTikaAnalysis.detectDocTypeUsingFacade(inputStream); 

        assertEquals("application/pdf", resultantMediaType); 

        inputStream.close(); 

    } 

    @Test

    public void byUsingParserAndGettingContent() throws IOException, TikaException, SAXException { 

        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("worddocument.docx"); 

        String documentContent = SampleTikaAnalysis.extractContentUsingParser(inputStream); 

        assertThat(documentContent, containsString("OpenSource REST API URL")); 

        assertThat(documentContent, containsString("Spring MVC")); 

        inputStream.close(); 

    } 

    @Test

    public void byUsingFacadeAndGettingContent() throws IOException, TikaException { 

        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("worddocument.docx"); 

        String documentContent = SampleTikaAnalysis.extractContentUsingFacade(inputStream); 

        assertThat(documentContent, containsString("OpenSource REST API URL")); 

        assertThat(documentContent, containsString("Spring MVC")); 

        inputStream.close(); 

    } 

    @Test

    public void byUsingParserAndGettingMetadata() throws IOException, TikaException, SAXException { 

        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("exceldocument.xlsx"); 

        Metadata retrieveMetadata = SampleTikaAnalysis.extractMetadatatUsingParser(inputStream); 

        assertEquals("org.apache.tika.parser.DefaultParser", retrieveMetadata.get("X-Parsed-By")); 

        assertEquals("Microsoft Office User", retrieveMetadata.get("Author")); 

        inputStream.close(); 

    } 

    @Test

    public void byUsingFacadeAndGettingMetadata() throws IOException, TikaException { 

        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("exceldocument.xlsx"); 

        Metadata retrieveMetadata = SampleTikaAnalysis.extractMetadatatUsingFacade(inputStream); 

        assertEquals("org.apache.tika.parser.DefaultParser", retrieveMetadata.get("X-Parsed-By")); 

        assertEquals("Microsoft Office User", retrieveMetadata.get("Author")); 

        inputStream.close(); 

    } 
}

Output of JUnit test case:

Test withDetectorFindingTheResultTypeAsDocumentType -> It is finding the document type by detector class and asserting the resultant document type to be pdf.
Test withFacadeFindingTheResultTypeAsDocumentType -> It is finding the document type by facade class and asserting the resultant document type to be pdf.
Test byUsingParserAndGettingContent -> By parsing and extracting the available word document in the mentioned path and asserting the resultant text.
Test byUsingFacadeAndGettingContent -> By facade class, extracting the available word document in the mentioned path and asserting the resultant text.
Test byUsingParserAndGettingMetadata -> By parsing and extracting the available excel document in the mentioned path and getting the metadata and asserting that
Test byUsingFacadeAndGettingMetadata -> By facade class, extracting the available excel document in the mentioned path and getting the metadata, and asserting that