Open In App

Java Program to Extract Content from a ODF File

Improve
Improve
Like Article
Like
Save
Share
Report

The full of ODF is Open Document Format. it is an international family of standards that’s the successor of commonly used deprecated vendor-specific document formats like .doc, .wpd, .xls . ODF documents are smaller when compared to other formats. OpenDocumentParser class is used from TIKA library to extract the content from the ODF file.

Methods used:

  1. BodyContentHandler(): It creates a content handler that writes XHTML body character events to an internal string buffer.
  2. Metadata() : It constructs new, empty metadata.
  3. ParseContext(): It creates a parse context object that is used to pass context information to Tika parsers.
  4. parse(): Instantiate the parser object, and invoke the parse method.

Following are the dependencies required for executing the following java code:

tika-parsers-1.24.1.jar
commons-io-2.8.0.jar
slf4j-api-2.0.0-alpha0.jar

Implementation:

Java




// Java Program to Extract Content from a ODF file
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.odf.OpenDocumentParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
 
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import sun.security.util.Length;
 
public class OdfContentExtractor {
    public static void main(String[] args)
    {
 
        try {
            BodyContentHandler handler
                = new BodyContentHandler();
 
            Metadata metadata = new Metadata();
 
            // Here .odt is open document text format.
            FileInputStream inputstream
                = new FileInputStream(
                    new File("F:\\geeks.odt"));
            ParseContext parsecontent = new ParseContext();
 
            // Parsing the open document.
            OpenDocumentParser opendocumentparser
                = new OpenDocumentParser();
 
            // Passing the InputStream , ContentHandler,
            // Metadata , ParseContext to the parse method.
            opendocumentparser.parse(inputstream, handler,
                                     metadata,
                                     parsecontent);
            System.out.println("Content in the document :"
                               + handler.toString());
 
            // Displaying the metadata of the odf file.
            System.out.println("Metadata of the document:");
            String[] metaName = metadata.names();
            int l = metaName.length;
            for (int i = 0; i < l; i++) {
                System.out.println(
                    metaName[i]
                    + " : =  " + metadata.get(metaName[i]));
            }
        }
        catch (Exception e) {
 
            System.out.println(
                "failed to extract content due to " + e);
        }
    }
}


Output:

Content in the document :Geekforgeeks has a great content on DSA.

Metadata of the document:
date : =  2020-11-21T05:38:00Z
meta:paragraph-count : =  1
meta:word-count : =  6
meta:initial-author : =  Mohan Sai
initial-creator : =  Mohan Sai
dc:creator : =  Mohan Sai
generator : =  MicrosoftOffice/15.0 MicrosoftWord
Word-Count : =  6
dcterms:created : =  2020-11-21T05:36:00Z
dcterms:modified : =  2020-11-21T05:38:00Z
Last-Modified : =  2020-11-21T05:38:00Z
nbPara : =  1
Last-Save-Date : =  2020-11-21T05:38:00Z
meta:character-count : =  40
Paragraph-Count : =  1
meta:save-date : =  2020-11-21T05:38:00Z
modified : =  2020-11-21T05:38:00Z
Edit-Time : =  PT0S
nbCharacter : =  40
nbPage : =  1
nbWord : =  6
Content-Type : =  application/vnd.oasis.opendocument.text
creator : =  Mohan Sai
meta:author : =  Mohan Sai
meta:creation-date : =  2020-11-21T05:36:00Z
Creation-Date : =  2020-11-21T05:36:00Z
xmpTPg:NPages : =  1
Character Count : =  40
editing-cycles : =  3
Page-Count : =  1
Author : =  Mohan Sai
meta:page-count : =  1


Last Updated : 16 Sep, 2021
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads