Apache Tika is a library that is used for document type detection and content extraction from various file formats. Using this, one can develop a universal type detector and content extractor to extract both structured text and metadata from different types of documents such as spreadsheets, text documents, images, PDF’s, and even multimedia input formats to a certain extent. Tika-Python is Python binding to the Apache TikaTM REST services allowing tika to be called natively in python language.
To install Tika type the below command in the terminal.
pip install tika
Note: Tika is written in Java, so you need a java(7 or 7+) runtime installed
For extracting contents from the PDF files we will use from_file() method of parser object. So let’s see the description first.
Syntax: parser.from_file(filename, additional)
- filename: This is location of file, it is opened in rb mode i.e. read binary mode
- additional: param service: service requested from the tika server, Default value is ‘all’, which results in recursive text content+metadata.
- ‘meta’ returns only metadata. ‘text’ returns only content.
- param xmlContent: You can have XML content, default value- False
Return type: dictionary.
Now, Let’s see the python program for Extracting pdf’s data:
Example 1: Extracting contents of the pdf file.
Example 2: Extracting Meta-Data of pdf file.
Example 3: Extract keys.
Example 4: Know the tika server status.
200 <class 'int'>
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.
- XML parsing in Python
- Command-Line Option and Argument Parsing using argparse in Python
- Parsing XML with DOM APIs in Python
- Argparse VS Docopt VS Click - Comparing Python Command-Line Parsing Libraries
- Parsing and Processing URL using Python - Regex
- NLP | Partial parsing with Regex
- Pandas | Parsing JSON Dataset
- Important differences between Python 2.x and Python 3.x with examples
- Python | Set 4 (Dictionary, Keywords in Python)
- Python | Sort Python Dictionaries by Key or Value
- Python | Merge Python key values to list
- Reading Python File-Like Objects from C | Python
- Python | Add Logging to a Python Script
- Python | Add Logging to Python Libraries
- Python | Visualizing O(n) using Python
- Python | Index of Non-Zero elements in Python list
- Python | Convert list to Python array
- MySQL-Connector-Python module in Python
- Python - Read blob object in python using wand library
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.