All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
Extracting Text from PDF File
Python package PyPDF can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.
Note: For more information, refer to Working with PDF files in Python
To install this package type the below command in the terminal.
pip install PyPDF2
Let us try to understand the above code in chunks:
pdfFileObj = open('example.pdf', 'rb')
We opened the example.pdf in binary mode. and saved the file object as pdfFileObj.
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
Here, we create an object of PdfFileReader class of PyPDF2 module and pass the pdf file object & get a pdf reader object.
numPages property gives the number of pages in the pdf file. For example, in our case, it is 20 (see first line of output).
pageObj = pdfReader.getPage(0)
Now, we create an object of PageObject class of PyPDF2 module. pdf reader object has function getPage() which takes page number (starting form index 0) as argument and returns the page object.
Page object has function extractText() to extract text from the pdf page.
At last, we close the pdf file object.
Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.
To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.
- Send PDF File through Email using pdf-mail module
- Convert Text and Text File to PDF using Python
- Convert PDF File Text to Audio Speech using Python
- Build an Application to extract URL and Metadata from a PDF using Python
- Extract numbers from a text file and add them using Python
- Python program to extract Email-id from URL text file
- Extract IP address from file using Python
- Python | Convert Image to Pdf using img2pdf module
- Python | Reading contents of PDF using OCR (Optical Character Recognition)
- Exporting PDF Data using Python
- Merge PDF stored in Remote server using Python
- PDF Redaction using Python
- Convert PDF to Image using Python
- Add Watermark to PDF using PyPDF4 in Python
- Python - Extract hashtags from text
- How to extract Time data from an Excel file column using Pandas?
- How to extract Email column from Excel file and find out the type of mail using Pandas?
- How to extract date from Excel file using Pandas?
- Encrypt and Decrypt PDF using PyPDF2
- Working with PDF files in Python
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to firstname.lastname@example.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.