Open In App
Related Articles

Exporting PDF Data using Python

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Report issue
Report
Sometimes, we have to extract data from PDF. we have to copy & paste the data from PDF. It is time-consuming. In Python, there are packages that we can use to extract data from a PDF and export it in a different format using Python. We will learn how to extract data from PDFs.

Extracting Text With PDFMiner

PDFMiner is a text extraction tool for PDF documents. you can try using pip to install PDFminer in your system as:
pip install pdfminer
Let’s get started with extracting all the text of PDF page by page. It requires the following steps to extract pages data
  • create a resource manager instance.
  • create a file-like object via Python’s io module.
  • create a converter.
  • create a PDF interpreter object that will take our resource manager and converter objects and extract the text.
  • open the PDF and loop through each page.
Below is the implementation. PDF File Used: python-pdfminer-1
import io
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
  
  
def extract_text_by_page(pdf_path):
  
    with open(pdf_path, 'rb') as fh:
          
        for page in PDFPage.get_pages(fh, 
                                      caching=True,
                                      check_extractable=True):
              
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
              
            converter = TextConverter(resource_manager, 
                                      fake_file_handle)
              
            page_interpreter = PDFPageInterpreter(resource_manager,
                                                  converter)
              
            page_interpreter.process_page(page)
            text = fake_file_handle.getvalue()
              
            yield text
              
            # close open handles
            converter.close()
            fake_file_handle.close()
              
def extract_text(pdf_path):
    for page in extract_text_by_page(pdf_path):
        print(page)
        print()
          
# Driver code
if __name__ == '__main__':
    print(extract_text('GFG.pdf'))

                    
Output: python-pdfminer-extract-data-from-pdf In this example, we create a function that yields the text for each page. The extract_text function prints out the text of each page.

Last Updated : 10 May, 2020
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads