How to Extract PDF Tables in Python?

Last Updated : 21 Oct, 2021

This topic is about the way to extract tables from a PDF enter Python. At first, let’s discuss what’s a PDF file?

PDF (Portable Document Format) may be a file format that has captured all the weather of a printed document as a bitmap that you simply can view, navigate, print, or forward to somebody else. PDF files are created using Adobe Acrobat,

Example :

Suppose a PDF file contains a Table

User_ID	Name	Occupation
1	David	Product Manage
2	Leo	IT Administrator
3	John	Lawyer

And we want to read this table into our Python Program. This problem can be solved using several approaches. Let’s discuss each one by one.

Method 1: Using tabula-py

The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can install the tabula-py library using the command.

pip install tabula-py
pip install tabulate

The methods used in the example are :

read_pdf(): reads the data from the tables of the PDF file of the given address

tabulate(): arranges the data in a table format

The PDF file used here is PDF.

Python3

from tabula import read_pdf
from tabulate import tabulate
 
#reads table from pdf file
df = read_pdf("abc.pdf",pages="all") #address of pdf file
print(tabulate(df))

Output:

Method 2: Using Camelot

Camelot is a Python library that helps to extract tables from PDF files. You can install the camelot-py library using the command

pip install camelot-py

The methods used in the example are :

read_pdf(): reads the data from the tables of the pdf file of the given address

tables[index].df: points towards the desired table of a given index

The PDF file used here is PDF.

Python3

import camelot
 
# extract all the tables in the PDF file
abc = camelot.read_pdf("test.pdf")   #address of file location
 
# print the first table as Pandas DataFrame
print(abc[0].df)