This topic is about the way to extract tables from a PDF enter Python. At first, let’s discuss what’s a PDF file?
PDF (Portable Document Format) may be a file format that has captured all the weather of a printed document as a bitmap that you simply can view, navigate, print, or forward to somebody else. PDF files are created using Adobe Acrobat,
Example :
Suppose a PDF file contains a TableUser_ID Name Occupation 1 David Product Manage 2 Leo IT Administrator 3 John Lawyer
And we want to read this table into our Python Program. This problem can be solved using several approaches. Let’s discuss each one by one.
Method 1: Using tabula-py
The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can install the tabula-py library using the command.
pip install tabula-py pip install tabulate
The methods used in the example are :
read_pdf(): reads the data from the tables of the PDF file of the given address
tabulate(): arranges the data in a table format
The PDF file used here is PDF.
Python3
from tabula import read_pdf from tabulate import tabulate #reads table from pdf file df = read_pdf( "abc.pdf" ,pages = "all" ) #address of pdf file print (tabulate(df)) |
Output:
Method 2: Using Camelot
Camelot is a Python library that helps to extract tables from PDF files. You can install the camelot-py library using the command
pip install camelot-py
The methods used in the example are :
read_pdf(): reads the data from the tables of the pdf file of the given address
tables[index].df: points towards the desired table of a given index
The PDF file used here is PDF.
Python3
import camelot # extract all the tables in the PDF file abc = camelot.read_pdf( "test.pdf" ) #address of file loation # print the first table as Pandas DataFrame print (abc[ 0 ].df) |
Output:
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.