The PDF (Portable Document Format) is the most common use platform-independent file format developed by Adobe to present documents. There are lots of PDF-related packages for Python, one of them is the pdfx module. The pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL.
Features:
- Extract references and metadata from a given PDF.
- Detects pdf, URL, arxiv, and DOI references.
- The fast, parallel download of all referenced PDFs.
- Check for broken links (using the -c flag).
- Output as text or JSON (using the -j flag).
- Extract the PDF text (using the –text flag).
- Use a command-line tool or Python package.
- Compatible with Python 2 and 3.
- Works with local and online pdfs.
Getting Started:
First, we need to install pdfx module, run the below code in the terminal.
pip install pdfx
Approach:
- Import pdfx module.
- Read PDF file with pdfx.PDFx() method.
- Get metadata with get_metadata() method.
- Get URL with get_references_as_dict() method.
Implementation:
Step 1: Importing modules and reading PDF files.
Python3
import pdfx
pdf = pdfx.PDFx( "geeksforgeeks.pdf" )
print (pdf)
|
Output:
<pdfx.PDFx at 0x1c189244a88>
It means pdfx.PDFx object created at 0x1c189244a88 this location on your memories.
Step 2: Getting metadata from PDF.
Output:
{‘Creator’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36’,
‘Producer’: ‘Skia/PDF m85’,
‘CreationDate’: “D:20200911041438+00’00′”,
‘ModDate’: “D:20200911041438+00’00′”,
‘Pages’: 2}
Step 3: Getting the URL form PDF.
Python3
pdf.get_references_as_dict()
|
Output:
{'url': ['https://www.geeksforgeeks.org/cookie-policy/',
'https://www.geeksforgeeks.org/privacy-policy/',
'https://www.geeksforgeeks.org/',
'https://www.geeksforgeeks.org/optparse-module-in-python/']}
Application to extract URL and Metadata from a PDF with tkinter: below script implements the above approach into a Graphical User Interface.
Python3
from tkinter import *
import pdfx
def get_info():
pdf = pdfx.PDFx( str (e1.get()))
meta. set (pdf.get_metadata())
url. set (pdf.get_references_as_dict())
master = Tk()
master.configure(bg = 'light grey' )
meta = StringVar()
url = StringVar()
Label(master, text = "PDF or PDF-URL : " , bg = "light grey" ).grid(row = 0 , sticky = W)
Label(master, text = "Meta information :" , bg = "light grey" ).grid(row = 3 , sticky = W)
Label(master, text = "URL information :" , bg = "light grey" ).grid(row = 4 , sticky = W)
Label(master, text = "", textvariable = meta,
bg = "light grey" ).grid(row = 3 , column = 1 , sticky = W)
Label(master, text = " ", textvariable=url, bg=" light grey").grid(
row = 4 , column = 1 , sticky = W)
e1 = Entry(master, width = 100 )
e1.grid(row = 0 , column = 1 )
b = Button(master, text = "Show" , command = get_info, bg = "Blue" )
b.grid(row = 0 , column = 2 , columnspan = 2 , rowspan = 2 , padx = 5 , pady = 5 ,)
mainloop()
|
Output:

Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape,
GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out -
check it out now!
Last Updated :
29 Dec, 2022
Like Article
Save Article