The full of ODF is Open Document Format. it is an international family of standards that’s the successor of commonly used deprecated vendor-specific document formats like .doc, .wpd, .xls . ODF documents are smaller when compared to other formats. OpenDocumentParser class is used from TIKA library to extract the content from the ODF file.
Methods used:
- BodyContentHandler(): It creates a content handler that writes XHTML body character events to an internal string buffer.
- Metadata() : It constructs new, empty metadata.
- ParseContext(): It creates a parse context object that is used to pass context information to Tika parsers.
- parse(): Instantiate the parser object, and invoke the parse method.
Following are the dependencies required for executing the following java code:
tika-parsers-1.24.1.jar commons-io-2.8.0.jar slf4j-api-2.0.0-alpha0.jar
Implementation:
Java
// Java Program to Extract Content from a ODF file import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.odf.OpenDocumentParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import sun.security.util.Length;
public class OdfContentExtractor {
public static void main(String[] args)
{
try {
BodyContentHandler handler
= new BodyContentHandler();
Metadata metadata = new Metadata();
// Here .odt is open document text format.
FileInputStream inputstream
= new FileInputStream(
new File( "F:\\geeks.odt" ));
ParseContext parsecontent = new ParseContext();
// Parsing the open document.
OpenDocumentParser opendocumentparser
= new OpenDocumentParser();
// Passing the InputStream , ContentHandler,
// Metadata , ParseContext to the parse method.
opendocumentparser.parse(inputstream, handler,
metadata,
parsecontent);
System.out.println( "Content in the document :"
+ handler.toString());
// Displaying the metadata of the odf file.
System.out.println( "Metadata of the document:" );
String[] metaName = metadata.names();
int l = metaName.length;
for ( int i = 0 ; i < l; i++) {
System.out.println(
metaName[i]
+ " : = " + metadata.get(metaName[i]));
}
}
catch (Exception e) {
System.out.println(
"failed to extract content due to " + e);
}
}
} |
Output:
Content in the document :Geekforgeeks has a great content on DSA. Metadata of the document: date : = 2020-11-21T05:38:00Z meta:paragraph-count : = 1 meta:word-count : = 6 meta:initial-author : = Mohan Sai initial-creator : = Mohan Sai dc:creator : = Mohan Sai generator : = MicrosoftOffice/15.0 MicrosoftWord Word-Count : = 6 dcterms:created : = 2020-11-21T05:36:00Z dcterms:modified : = 2020-11-21T05:38:00Z Last-Modified : = 2020-11-21T05:38:00Z nbPara : = 1 Last-Save-Date : = 2020-11-21T05:38:00Z meta:character-count : = 40 Paragraph-Count : = 1 meta:save-date : = 2020-11-21T05:38:00Z modified : = 2020-11-21T05:38:00Z Edit-Time : = PT0S nbCharacter : = 40 nbPage : = 1 nbWord : = 6 Content-Type : = application/vnd.oasis.opendocument.text creator : = Mohan Sai meta:author : = Mohan Sai meta:creation-date : = 2020-11-21T05:36:00Z Creation-Date : = 2020-11-21T05:36:00Z xmpTPg:NPages : = 1 Character Count : = 40 editing-cycles : = 3 Page-Count : = 1 Author : = Mohan Sai meta:page-count : = 1