Open In App

Convert PDF to CSV using Python

Python is a high-level, general-purpose, and very popular programming language. Python programming language (the latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java.

In this article, we will learn how to convert a PDF File to CSV File Using Python. Here we will discuss various methods for conversion. For all methods, we are using an input PDF file.



Method 1:

Here will use the pdftables_api Module for converting the PDF file into any other format. The pdftables_api module is used for reading the tables in a PDF. It also allows us to convert PDF Files into another format.



Installation:

Open Command Prompt and type "pip install git+https://github.com/pdftables/python-pdftables-api.git"

Approach:

Syntax:

pdftables_api.Client('API KEY').csv(pdf_path, csv_path)

Below is the Implementation:

PDF File Used:

PDF FILE




# Import Module
import pdftables_api
  
# API KEY VERIFICATION
conversion = pdftables_api.Client('API KEY')
  
# PDf to CSV 
# (Hello.pdf, Hello)
conversion.csv(pdf_file_path, output_file_path)

Output:

CSV FILE

Method 2:

Here will use the tabula-py Module for converting the PDF file into any other format. The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV, or a JSON file.

Installation:

pip install tabula-py

Before we start, first we need to install java and add a java installation folder to the PATH variable.

Approach:

Syntax:

read_pdf(PDF File Path, pages = Number of pages, **agrs)

Below is the Implementation:

PDF File Used:

PDF FILE




# Import Module 
import tabula
  
# Read PDF File
# this contain a list
df = tabula.read_pdf(PDF File Path, pages = 1)[0]
  
# Convert into Excel File
df.to_csv('Excel File Path')

Output:

CSV FILE


Article Tags :