Maven and JUnit Project – Extracting Content and Metadata via Apache Tika
In the software industry, contents are transferred or portable via various documents with formats like TXT, XLS, or PDF or even sometimes with MP4 format also. As multiple formats are in use, there should be a common way to extract the content and metadata from them. It can be possible via Apache Tika, a powerful versatile library for content analysis. As an introduction, let us see how parsing can be done and get the contents, nature of the document, etc., by going through the features of Apache Tikka. Via a sample maven project, let us see them.
Example Maven Project
First and foremost thing is we need to see the dependencies required for Apache Tika and those need to be specified in pom.xml
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.17</version> </dependency>
For our project, let us see all the dependencies via
Heart of apache tika is Parser API. While parsing the documents mostly Apache POI or PDFBox will be used
void parse( InputStream inputStream, // This is the input document that need to be parsed ContentHandler contentHandler, // handler by processing export the result in a particular form Metadata metadata, //metadata properties ParseContext parseContext // for customizing parsing process) throws IOException, SAXException, TikaException
Document type detection can be done by using an implementation class of the Detector interface. The below-mentioned method is available here
MediaType detect(java.io.InputStream inputStream, Metadata metadata) throws IOException
Language detection can also be done by Tika and identification of language is done without the help of metadata information. Now via the sample project java file contents, let us cover the topic as well
In this program, the following ways are handled both in detector and facade pattern
- Detecting document type
- Extracting the content using a parser and facade
- Extracting the metadata using a parser and facade
Let us test the above concepts by taking 3 documents namely exceldocument.xlsx, pdfdocument.txt, and worddocument.docx. They should be available under the test/resources folder so that they can be read from the mentioned way in the code. Let us test the contents now via
Output of JUnit test case:
- Test withDetectorFindingTheResultTypeAsDocumentType -> It is finding the document type by detector class and asserting the resultant document type to be pdf.
- Test withFacadeFindingTheResultTypeAsDocumentType -> It is finding the document type by facade class and asserting the resultant document type to be pdf.
- Test byUsingParserAndGettingContent -> By parsing and extracting the available word document in the mentioned path and asserting the resultant text.
- Test byUsingFacadeAndGettingContent -> By facade class, extracting the available word document in the mentioned path and asserting the resultant text.
- Test byUsingParserAndGettingMetadata -> By parsing and extracting the available excel document in the mentioned path and getting the metadata and asserting that
- Test byUsingFacadeAndGettingMetadata -> By facade class, extracting the available excel document in the mentioned path and getting the metadata, and asserting that
Apache Tika is a wonderful content analysis versatile library used across the software industry for multiple purposes.
Please Login to comment...