Exporting PDF Data using Python

Sometimes, we have to extract data from PDF. we have to copy & paste the data from PDF. It is time-consuming. In Python, there are packages that we can use to extract data from a PDF and export it in a different format using Python. We will learn how to extract data from PDFs.

Extracting Text With PDFMiner

PDFMiner is a text extraction tool for PDF documents. you can try using pip to install PDFminer in your system as:

pip install pdfminer

Let’s get started with extracting all the text of PDF page by page. It requires the following steps to extract pages data

  • create a resource manager instance.
  • create a file-like object via Python’s io module.
  • create a converter.
  • create a PDF interpreter object that will take our resource manager and converter objects and extract the text.
  • open the PDF and loop through each page.

Below is the implementation.

PDF File Used:



python-pdfminer-1

filter_none

edit
close

play_arrow

link
brightness_4
code

import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
  
  
def extract_text_by_page(pdf_path):
  
    with open(pdf_path, 'rb') as fh:
          
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
              
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
              
            converter = TextConverter(resource_manager, 
                                      fake_file_handle)
              
            page_interpreter = PDFPageInterpreter(resource_manager,
                                                  converter)
              
            page_interpreter.process_page(page)
            text = fake_file_handle.getvalue()
              
            yield text
              
            # close open handles
            converter.close()
            fake_file_handle.close()
              
def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
        print(page)
        print()
          
# Driver code
if __name__ == '__main__':
    print(extract_text('GFG.pdf'))

chevron_right


Output:

python-pdfminer-extract-data-from-pdf

In this example, we create a function that yields the text for each page. The extract_text function prints out the text of each page.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.