Open In App

Maven and JUnit Project – Extracting Content and Metadata via Apache Tika

In the software industry, contents are transferred or portable via various documents with formats like TXT, XLS, or PDF or even sometimes with MP4 format also. As multiple formats are in use, there should be a common way to extract the content and metadata from them. It can be possible via Apache Tika, a powerful versatile library for content analysis. As an introduction, let us see how parsing can be done and get the contents, nature of the document, etc., by going through the features of Apache Tikka. Via a sample maven project, let us see them.

Example Maven Project

Project Structure:



 

First and foremost thing is we need to see the dependencies required for Apache Tika and those need to be specified in pom.xml

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.17</version>
</dependency>

For our project, let us see all the dependencies via 



pom.xml




<?xml version="1.0" encoding="UTF-8"?>
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
                        http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <artifactId>apache-tika</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>apache-tika</name>
  
    <parent>
        <groupId>com.gfg</groupId>
        <artifactId>parent-modules</artifactId>
        <version>1.0.0-SNAPSHOT</version>
    </parent>
  
    <dependencies>
        <dependency>
            <groupId>org.apache.tika</groupId>
            <artifactId>tika-parsers</artifactId>
            <version>${tika.version}</version>
        </dependency>
    </dependencies>
  
    <properties>
        <tika.version>1.17</tika.version>
    </properties>
  
</project>

Heart of apache tika is Parser API. While parsing the documents mostly Apache POI or PDFBox will be used 

void parse(
  InputStream inputStream, // This is the input document that need to be parsed
  ContentHandler contentHandler, // handler by processing export the result in a particular form
  Metadata metadata, //metadata properties
  ParseContext parseContext // for customizing parsing process) 
  throws IOException, SAXException, TikaException

Document type detection can be done by using an implementation class of the Detector interface. The below-mentioned method is available here

MediaType detect(java.io.InputStream inputStream, Metadata metadata) throws IOException

Language detection can also be done by Tika and identification of language is done without the help of metadata information. Now via the sample project java file contents, let us cover the topic as well

SampleTikaAnalysis.java

In this program, the following ways are handled both in detector and facade pattern




import java.io.IOException;
import java.io.InputStream;
  
import org.apache.tika.Tika;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
  
public class SampleTikaAnalysis {
    // Detecting the document type by using Detector
    public static String detectingTheDocTypeByUsingDetector(InputStream inputStream) throws IOException {
        Detector detector = new DefaultDetector();
        Metadata metadata = new Metadata();
  
        MediaType mediaType = detector.detect(inputStream, metadata);
        return mediaType.toString();
    }
    
    // Detecting the document type by using Facade
    public static String detectDocTypeUsingFacade(InputStream inputStream) throws IOException {
        Tika tika = new Tika();
        String mediaType = tika.detect(inputStream);
        return mediaType;
    }
  
    public static String extractContentUsingParser(InputStream inputStream) throws IOException, TikaException, SAXException {
        Parser parser = new AutoDetectParser();
        ContentHandler contentHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
  
        parser.parse(inputStream, contentHandler, metadata, context);
        return contentHandler.toString();
    }
  
    public static String extractContentUsingFacade(InputStream inputStream) throws IOException, TikaException {
        Tika tika = new Tika();
        String content = tika.parseToString(inputStream);
        return content;
    }
  
    public static Metadata extractMetadatatUsingParser(InputStream inputStream) throws IOException, SAXException, TikaException {
        Parser parser = new AutoDetectParser();
        ContentHandler contentHandler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
  
        parser.parse(inputStream, contentHandler, metadata, context);
        return metadata;
    }
  
    public static Metadata extractMetadatatUsingFacade(InputStream inputStream) throws IOException, TikaException {
        Tika tika = new Tika();
        Metadata metadata = new Metadata();
  
        tika.parse(inputStream, metadata);
        return metadata;
    }
}

Let us test the above concepts by taking 3 documents namely exceldocument.xlsx, pdfdocument.txt, and worddocument.docx. They should be available under the test/resources folder so that they can be read from the mentioned way in the code. Let us test the contents now via 

SampleTikaWayUnitTest.java




import static org.hamcrest.CoreMatchers.containsString;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertThat;
  
import java.io.IOException;
import java.io.InputStream;
  
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.junit.Test;
import org.xml.sax.SAXException;
  
public class SampleTikaWayUnitTest {
    
    @Test
    public void withDetectorFindingTheResultTypeAsDocumentType() throws IOException {
        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("pdfdocument.txt");
        String resultantMediaType = SampleTikaAnalysis.detectingTheDocTypeByUsingDetector(inputStream);
  
        assertEquals("application/pdf", resultantMediaType);
  
        inputStream.close();
    }
  
    @Test
    public void withFacadeFindingTheResultTypeAsDocumentType() throws IOException {
        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("pdfdocument.txt");
        String resultantMediaType = SampleTikaAnalysis.detectDocTypeUsingFacade(inputStream);
  
        assertEquals("application/pdf", resultantMediaType);
  
        inputStream.close();
    }
  
    @Test
    public void byUsingParserAndGettingContent() throws IOException, TikaException, SAXException {
        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("worddocument.docx");
        String documentContent = SampleTikaAnalysis.extractContentUsingParser(inputStream);
  
        assertThat(documentContent, containsString("OpenSource REST API URL"));
        assertThat(documentContent, containsString("Spring MVC"));
  
        inputStream.close();
    }
  
    @Test
    public void byUsingFacadeAndGettingContent() throws IOException, TikaException {
        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("worddocument.docx");
        String documentContent = SampleTikaAnalysis.extractContentUsingFacade(inputStream);
  
        assertThat(documentContent, containsString("OpenSource REST API URL"));
        assertThat(documentContent, containsString("Spring MVC"));
  
        inputStream.close();
    }
  
    @Test
    public void byUsingParserAndGettingMetadata() throws IOException, TikaException, SAXException {
        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("exceldocument.xlsx");
        Metadata retrieveMetadata = SampleTikaAnalysis.extractMetadatatUsingParser(inputStream);
  
        assertEquals("org.apache.tika.parser.DefaultParser", retrieveMetadata.get("X-Parsed-By"));
        assertEquals("Microsoft Office User", retrieveMetadata.get("Author"));
  
        inputStream.close();
    }
  
    @Test
    public void byUsingFacadeAndGettingMetadata() throws IOException, TikaException {
        InputStream inputStream = this.getClass().getClassLoader().getResourceAsStream("exceldocument.xlsx");
        Metadata retrieveMetadata = SampleTikaAnalysis.extractMetadatatUsingFacade(inputStream);
  
        assertEquals("org.apache.tika.parser.DefaultParser", retrieveMetadata.get("X-Parsed-By"));
        assertEquals("Microsoft Office User", retrieveMetadata.get("Author"));
  
        inputStream.close();
    }
}

Output of JUnit test case:

 

Conclusion

Apache Tika is a wonderful content analysis versatile library used across the software industry for multiple purposes.


Article Tags :