Parsing PDFs in Python with Tika

Apache Tika is a library that is used for document type detection and content extraction from various file formats. Using this, one can develop a universal type detector and content extractor to extract both structured text and metadata from different types of documents such as spreadsheets, text documents, images, PDF’s, and even multimedia input formats to a certain extent. Tika-Python is Python binding to the Apache TikaTM REST services allowing tika to be called natively in python language.

Installation:

To install Tika type the below command in the terminal. 

pip install tika

Note: Tika is written in Java, so you need a java(7 or 7+) runtime installed

For extracting contents from the PDF files we will use from_file() method of parser object. So let’s see the description first.



Syntax: parser.from_file(filename, additional)

Parameters:

  • filename: This is location of file, it is opened in rb mode i.e. read binary mode
  • additional: param service: service requested from the tika server, Default value is ‘all’, which results in recursive text content+metadata.
    • ‘meta’ returns only metadata. ‘text’ returns only content.
    • param xmlContent: You can have XML content, default value- False

Return type: dictionary.

Now, Let’s see the python program for Extracting pdf’s data:

Example 1: Extracting contents of the pdf file.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# import parser object from tike
from tika import parser  
  
# opening pdf file
parsed_pdf = parser.from_file("sample.pdf")
  
# saving content of pdf
# you can also bring text only, by parsed_pdf['text'] 
# parsed_pdf['content'] returns string 
data = parsed_pdf['content'
  
# Printing of content 
print(data)
  
# <class 'str'>
print(type(data))

chevron_right


Output:

pdf content



Example 2: Extracting Meta-Data of pdf file.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# import parser object from tike
from tika import parser  
  
parsed_pdf = parser.from_file("sample.pdf")
  
# ['metadata'] attribute returns 
# key-value pairs of meta-data 
print(parsed_pdf['metadata']) 
  
# <class 'dict'>
print(type(parsed_pdf['metadata']))

chevron_right


Output:

Meta data

Example 3: Extract keys.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

from tika import parser
  
parsed_pdf=parser.from_file("sample.pdf")
  
# Returns keys applicable for given pdf.
print(parsed_pdf.keys())

chevron_right


Output:

keys of the paresed dictionary

Example 4: Know the tika server status.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

from tika import parser
  
# You can also know the 
# status returned from tika 
# server, 200 for success 
parsed_pdf= parser.from_file("sample.pdf")
  
print(parsed_pdf['status'],type(parsed_pdf['status'] ))

chevron_right


Output:

200 <class 'int'>

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

3


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.