Convert PDF to TXT File Using Python

As the modern world gets digitalized, it is more and more necessary to extract text from PDF documents for purposes such as data analysis or content processing. There is a versatile ecosystem of Python libraries that can work with different file formats including PDFs. In this article, we will show how to build a simple PDF-to-text converter in Python using the PyPDF2 library.

What is PyPDF2?

PyPDF2 serves as a library for handling PDF files in Python language. It supports functionalities like extracting texts out of them, merging them, splitting them into smaller parts, cropping their pages, and manipulating them programmatically. This makes it easy for us to extract texts from those files and play around with it.

Convert a PDF to TXT Using Python

Below is the implementation of Design a PDF to TXT converter using Python:

Installation of PyPDF2

Open the Command prompt in your system and use the following pip command. The library will start getting installed and can be used further.

pip install PyPDF2.

Installation

Writing Python Code to Convert PDF to TXT File

gfg.pdf

GeeksforGeeks is coding Platform

In this example, below Python code uses the PyPDF2 library to convert a PDF file to text. It defines a function, pdf_to_text, which opens the PDF file, reads each page, extracts text from each page, and writes the extracted text to a specified text file. When executed, it converts a PDF file ('gfg.pdf' in this case) into a text file ('gfg.txt') and prints a success message.

Python3

import PyPDF2

def pdf_to_text(pdf_path, output_txt):
    # Open the PDF file in read-binary mode
    with open(pdf_path, 'rb') as pdf_file:
        # Create a PdfReader object instead of PdfFileReader
        pdf_reader = PyPDF2.PdfReader(pdf_file)

        # Initialize an empty string to store the text
        text = ''

        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()

    # Write the extracted text to a text file
    with open(output_txt, 'w', encoding='utf-8') as txt_file:
        txt_file.write(text)

if __name__ == "__main__":
    pdf_path = 'gfg.pdf'

    output_txt = 'gfg.txt'

    pdf_to_text(pdf_path, output_txt)

    print("PDF converted to text successfully!")

Output:

PDF converted to text successfully!

gfg.txt

GeeksforGeeks is coding Platform

Article Tags :

Python

Python Programs

PDF-Converter