Skip to content
Related Articles

Related Articles

Exporting PDF Data using Python
  • Last Updated : 10 May, 2020

Sometimes, we have to extract data from PDF. we have to copy & paste the data from PDF. It is time-consuming. In Python, there are packages that we can use to extract data from a PDF and export it in a different format using Python. We will learn how to extract data from PDFs.

Extracting Text With PDFMiner

PDFMiner is a text extraction tool for PDF documents. you can try using pip to install PDFminer in your system as:

pip install pdfminer

Let’s get started with extracting all the text of PDF page by page. It requires the following steps to extract pages data

  • create a resource manager instance.
  • create a file-like object via Python’s io module.
  • create a converter.
  • create a PDF interpreter object that will take our resource manager and converter objects and extract the text.
  • open the PDF and loop through each page.

Below is the implementation.

PDF File Used:



python-pdfminer-1

filter_none

edit
close

play_arrow

link
brightness_4
code

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
  
  
def extract_text_by_page(pdf_path):
  
    with open(pdf_path, 'rb') as fh:
          
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
              
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
              
            converter = TextConverter(resource_manager, 
                                      fake_file_handle)
              
            page_interpreter = PDFPageInterpreter(resource_manager,
                                                  converter)
              
            page_interpreter.process_page(page)
            text = fake_file_handle.getvalue()
              
            yield text
              
            # close open handles
            converter.close()
            fake_file_handle.close()
              
def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
        print(page)
        print()
          
# Driver code
if __name__ == '__main__':
    print(extract_text('GFG.pdf'))

chevron_right


Output:

python-pdfminer-extract-data-from-pdf

In this example, we create a function that yields the text for each page. The extract_text function prints out the text of each page.

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.

My Personal Notes arrow_drop_up
Recommended Articles
Page :