Skip to content
Related Articles

Related Articles

Save Article
Improve Article
Save Article
Like Article

Convert PDF to CSV using Python

  • Difficulty Level : Basic
  • Last Updated : 02 Feb, 2021

Python is a high-level, general-purpose, and very popular programming language. Python programming language (the latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java.

In this article, we will learn how to convert a PDF File to CSV File Using Python. Here we will discuss various methods for conversion. For all methods, we are using an input PDF file.

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning - Basic Level Course

Method 1:



Here will use the pdftables_api Module for converting the PDF file into any other format. The pdftables_api module is used for reading the tables in a PDF. It also allows us to convert PDF Files into another format.

Installation:

Open Command Prompt and type "pip install git+https://github.com/pdftables/python-pdftables-api.git"
  • It will install the pdftables_api Module
  • After Installation, you need an API KEY.
  • Go to PDFTables.com and signup, then visit the API Page to see your API KEY.

Approach:

  • Verify the API key.
  • For Converting PDF File Into CSV File we will use csv() method.

Syntax:

pdftables_api.Client('API KEY').csv(pdf_path, csv_path)

Below is the Implementation:

PDF File Used:

PDF FILE

Python3




# Import Module
import pdftables_api
  
# API KEY VERIFICATION
conversion = pdftables_api.Client('API KEY')
  
# PDf to CSV 
# (Hello.pdf, Hello)
conversion.csv(pdf_file_path, output_file_path)

Output:



CSV FILE

Method 2:

Here will use the tabula-py Module for converting the PDF file into any other format. The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV, or a JSON file.

Installation:

pip install tabula-py

Before we start, first we need to install java and add a java installation folder to the PATH variable.

  • Install java click here
  • Add java installation folder (C:\Program Files (x86)\Java\jre1.8.0_251\bin) to the environment path variable

Approach:

  • Read PDF file using read_pdf() method.
  • Then we will convert the PDF files into a CSV file using the to_csv() method.

Syntax:

read_pdf(PDF File Path, pages = Number of pages, **agrs)

Below is the Implementation:

PDF File Used:

PDF FILE

Python3




# Import Module 
import tabula
  
# Read PDF File
# this contain a list
df = tabula.read_pdf(PDF File Path, pages = 1)[0]
  
# Convert into Excel File
df.to_csv('Excel File Path')

Output:

CSV FILE




My Personal Notes arrow_drop_up
Recommended Articles
Page :