Java supports multiple in-built classes and packages to extract and access the content from a PDF document. The following classes are used in the extraction of the content : BodyContentHandler is an in-built class that creates a handler for the text, which writes these XHTML body character events and stores them in an internal string buffer. It is inherited from the parent class ContentHandlerDecorator in Java. The specified text can be retrieved using the method ContentHandlerDecorator.toString() provided by the parent class. PDFParser Java provides an in-built package that provides a class PDFParser, which parses the contents of PDF documents. It extracts the contents of a PDF Document stored within paragraphs, strings, and tables (without invoking tabular boundaries). It can be used to parse encrypted documents too if the password is specified as an argument. ParseContext: This class is a component of the Java package org.apache.tika.parser, which is used to parse context and pass it on to the Tika parsers. Procedure: Implementation: The following Java program is used to illustrate the extraction of content from the PDF document. Output: The following are the contents of the file at the local directory made is as follows: Attention reader! Don’t stop learning now. Get hold of all the important Java Foundation and Collections concepts with the Fundamentals of Java and Java Collections Course at a student-friendly price and become industry ready. To complete your preparation from learning a language to DS Algo and many more, please refer Complete Interview Preparation Course.Java
// Java Program to Extract Content from a PDF
// Importing java input/output classes
import
java.io.File;
import
java.io.FileInputStream;
// Imporing Apache POI classes
import
org.apache.tika.metadata.Metadata;
import
org.apache.tika.parser.ParseContext;
import
org.apache.tika.parser.pdf.PDFParser;
import
org.apache.tika.sax.BodyContentHandler;
// Class
public
class
GFG {
// Main driver method
public
static
void
main(String[] args)
throws
Exception
{
// Create a content handler
BodyContentHandler contenthandler
=
new
BodyContentHandler();
// Create a file in local directory
File f =
new
File(
"C:/extractcontent.pdf"
);
// Create a file input stream
// on specified path with the created file
FileInputStream fstream =
new
FileInputStream(f);
// Create an object of type Metadata to use
Metadata data =
new
Metadata();
// Create a context parser for the pdf document
ParseContext context =
new
ParseContext();
// PDF document can be parsed using the PDFparser
// class
PDFParser pdfparser =
new
PDFParser();
// Method parse invoked on PDFParser class
pdfparser.parse(fstream, contenthandler, data,
context);
// Printing the contents of the pdf document
// using toString() method in java
System.out.println(
"Extracting contents :"
+ contenthandler.toString());
}
}
Data Structures and Algorithms – Self Paced Course
View Details