Open In App

Exporting PDF Data using Python

Last Updated : 10 May, 2020
Improve
Improve
Like Article
Like
Save
Share
Report

Sometimes, we have to extract data from PDF. we have to copy & paste the data from PDF. It is time-consuming. In Python, there are packages that we can use to extract data from a PDF and export it in a different format using Python. We will learn how to extract data from PDFs.

Extracting Text With PDFMiner

PDFMiner is a text extraction tool for PDF documents. you can try using pip to install PDFminer in your system as:

pip install pdfminer

Let’s get started with extracting all the text of PDF page by page. It requires the following steps to extract pages data

  • create a resource manager instance.
  • create a file-like object via Python’s io module.
  • create a converter.
  • create a PDF interpreter object that will take our resource manager and converter objects and extract the text.
  • open the PDF and loop through each page.

Below is the implementation.

PDF File Used:

python-pdfminer-1




import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
  
  
def extract_text_by_page(pdf_path):
  
    with open(pdf_path, 'rb') as fh:
          
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
              
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
              
            converter = TextConverter(resource_manager, 
                                      fake_file_handle)
              
            page_interpreter = PDFPageInterpreter(resource_manager,
                                                  converter)
              
            page_interpreter.process_page(page)
            text = fake_file_handle.getvalue()
              
            yield text
              
            # close open handles
            converter.close()
            fake_file_handle.close()
              
def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
        print(page)
        print()
          
# Driver code
if __name__ == '__main__':
    print(extract_text('GFG.pdf'))


Output:

python-pdfminer-extract-data-from-pdf

In this example, we create a function that yields the text for each page. The extract_text function prints out the text of each page.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads