Convert PDF to CSV using Python

Python is a high-level, general-purpose, and very popular programming language. Python programming language (the latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java.

In this article, we will learn how to convert a PDF File to CSV File Using Python. Here we will discuss various methods for conversion. For all methods, we are using an input PDF file.

Method 1:

Here will use the pdftables_api Module for converting the PDF file into any other format. The pdftables_api module is used for reading the tables in a PDF. It also allows us to convert PDF Files into another format.

Installation:

Open Command Prompt and type "pip install git+https://github.com/pdftables/python-pdftables-api.git"

It will install the pdftables_api Module
After Installation, you need an API KEY.
Go to PDFTables.com and signup, then visit the API Page to see your API KEY.

Approach:

Verify the API key.
For Converting PDF File Into CSV File we will use csv() method.

Syntax:

pdftables_api.Client('API KEY').csv(pdf_path, csv_path)

Below is the Implementation:

PDF File Used:

PDF FILE

Python3

# Import Module 

import pdftables_api 

# API KEY VERIFICATION 

conversion = pdftables_api.Client('API KEY') 

# PDf to CSV  
# (Hello.pdf, Hello) 
conversion.csv(pdf_file_path, output_file_path)

Output:

CSV FILE

Method 2:

Here will use the tabula-py Module for converting the PDF file into any other format. The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV, or a JSON file.

Installation:

pip install tabula-py

Before we start, first we need to install java and add a java installation folder to the PATH variable.

Install java click here
Add java installation folder (C:\Program Files (x86)\Java\jre1.8.0_251\bin) to the environment path variable

Approach:

Read PDF file using read_pdf() method.
Then we will convert the PDF files into a CSV file using the to_csv() method.

Syntax:

read_pdf(PDF File Path, pages = Number of pages, **agrs)

Below is the Implementation:

PDF File Used:

PDF FILE

Python3

# Import Module  

import tabula 

# Read PDF File 
# this contain a list 

df = tabula.read_pdf(PDF File Path, pages = 1)[0] 

# Convert into Excel File 

df.to_csv('Excel File Path')

Output:

CSV FILE

Article Tags :

Python

Listicles

python-utility