Sometimes, we have to extract data from PDF. we have to copy & paste the data from PDF. It is time-consuming. In Python, there are packages that we can use to extract data from a PDF and export it in a different format using Python. We will learn how to extract data from PDFs.

Extracting Text With PDFMiner

PDFMiner is a text extraction tool for PDF documents. you can try using pip to install PDFminer in your system as:
pip install pdfminer
Let’s get started with extracting all the text of PDF page by page. It requires the following steps to extract pages data
  • create a resource manager instance.
  • create a file-like object via Python’s io module.
  • create a converter.
  • create a PDF interpreter object that will take our resource manager and converter objects and extract the text.
  • open the PDF and loop through each page.
Below is the implementation. PDF File Used: python-pdfminer-1
import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
def extract_text_by_page(pdf_path):
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, 
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
            converter = TextConverter(resource_manager, 
            page_interpreter = PDFPageInterpreter(resource_manager,
            text = fake_file_handle.getvalue()
            yield text
            # close open handles
def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
# Driver code
if __name__ == '__main__':

In this example, we create a function that yields the text for each page. The extract_text function prints out the text of each page.

Last Updated : 10 May, 2020
