BodyContentHandler Class in Java
Last Updated :
28 Oct, 2021
Apache Tika is a library that allows you to extract data from different documents(.PDF, .DOCX, etc.). In this tutorial, we will extract data by using BodyContentHandler.Next dependency that will be used is shown below:
<dependency>
<groupId>org.apache.tika < / groupId >
<artifactId>tika - parsers < / artifactId >
<version>1.26 < / version >
< / dependency >
BodyContentHandler is a class decorator that allows one to get everything inside XHTML <body> tag. <body> or <body/> will not be included into result value.
Let us discuss first various constructors of this class is as follows:
BodyContentHandler() |
Writes all content into an internal string buffer, to get content just call toString(). By default, the maximum content length is 100 000 characters. If this limit is reached, a SAXException will be thrown. |
BodyContentHandler(writeLimit) |
Writes all content into an internal string buffer, to get content just call toString().
‘write limit’ is the maximum number of characters that can be read, set -1 to disable the limit. If this limit is reached, a SAXException will be thrown.
|
BodyContentHandler(OutputStream outputStream) |
Writes all content into a given outputStream. Without any content limit. |
BodyContentHandler(Writer writer) |
Writes all content into a given writer. Without any content limit. |
BodyContentHandler(ContentHandler handler) |
Passes all content to a given handler. |
The methods of this class is as follows:
Method |
Action Performed |
MatchingContentHandler |
Allows you to get data by XPath |
Note: BodyContentHandler class doesn’t implement any method of ContentHandler interface, it just describes XPath for MatchingContentHandler to get XHTML body content.
Implementation:
Example 1: Reading everything into the inner string buffer
Java
public class GFG {
public String parseToStringExample(String fileName)
throws IOException, TikaException, SAXException
{
InputStream stream
= this .getClass()
.getClassLoader()
.getResourceAsStream(fileName);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
return handler.toString();
}
public static void main(String[] args)
throws TikaException, IOException, SAXException
{
GFG example = new GFG();
System.out.println( "Result" );
System.out.println(example.parseToStringExample(
"test-reading.pdf" ));
}
}
|
Output:
Example 2: Writing content into a file with specifying the maximum content length
Java
public class GFG {
public static void main(String[] args)
throws TikaException, IOException, SAXException
{
GFG example = new GFG();
example.writeParsedDataToFile(
"test-reading.pdf" ,
"/Users/ali_zhagparov/Desktop/pdf-content.txt" );
}
public void
writeParsedDataToFile(String readFromFileName,
String writeToFileName)
throws IOException, TikaException, SAXException
{
InputStream stream
= this .getClass()
.getClassLoader()
.getResourceAsStream(readFromFileName);
File yourFile = new File(writeToFileName);
yourFile.createNewFile();
FileOutputStream fileOutputStream
= new FileOutputStream(yourFile, false );
Parser parser = new AutoDetectParser();
ContentHandler handler
= new BodyContentHandler(fileOutputStream);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
}
}
|
Output:
There is nothing visible on the console window as there it files directory mapping where in this case it tries to write all information into a file
The program results in a ‘.txt’ with ‘.pdf’ file content which is as follows:
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...