Build an Application to extract URL and Metadata from a PDF using Python

Last Updated : 29 Dec, 2022

The PDF (Portable Document Format) is the most common use platform-independent file format developed by Adobe to present documents. There are lots of PDF-related packages for Python, one of them is the pdfx module. The pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL.

Features:

Extract references and metadata from a given PDF.
Detects pdf, URL, arxiv, and DOI references.
The fast, parallel download of all referenced PDFs.
Check for broken links (using the -c flag).
Output as text or JSON (using the -j flag).
Extract the PDF text (using the –text flag).
Use a command-line tool or Python package.
Compatible with Python 2 and 3.
Works with local and online pdfs.

Getting Started:

First, we need to install pdfx module, run the below code in the terminal.

pip install pdfx

Approach:

Import pdfx module.
Read PDF file with pdfx.PDFx() method.
Get metadata with get_metadata() method.
Get URL with get_references_as_dict() method.

Implementation:

Step 1: Importing modules and reading PDF files.

Python3

# import module
import pdfx
 
# reading pdf file
pdf = pdfx.PDFx("geeksforgeeks.pdf")
 
# display
print(pdf)

Output:

<pdfx.PDFx at 0x1c189244a88>

It means pdfx.PDFx object created at 0x1c189244a88 this location on your memories.

Step 2: Getting metadata from PDF.

Python3

pdf.get_metadata()

Output:

{‘Creator’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36’,
‘Producer’: ‘Skia/PDF m85’,
‘CreationDate’: “D:20200911041438+00’00′”,
‘ModDate’: “D:20200911041438+00’00′”,
‘Pages’: 2}

Step 3: Getting the URL form PDF.

Python3

pdf.get_references_as_dict()

Output:

{'url': ['https://www.geeksforgeeks.org/cookie-policy/',
  'https://www.geeksforgeeks.org/privacy-policy/',
  'https://www.geeksforgeeks.org/',
  'https://www.geeksforgeeks.org/optparse-module-in-python/']}

Application to extract URL and Metadata from a PDF with tkinter: below script implements the above approach into a Graphical User Interface.

Python3

# import modules
from tkinter import *
import pdfx
 
 
# user defined function
def get_info():
 
    pdf = pdfx.PDFx(str(e1.get()))
    meta.set(pdf.get_metadata())
    url.set(pdf.get_references_as_dict())
 
 
# object of tkinter
# and background set for light grey
master = Tk()
master.configure(bg='light grey')
 
 
# Variable Classes in tkinter
meta = StringVar()
url = StringVar()
 
 
# Creating label for each information
# name using widget Label
Label(master, text="PDF or PDF-URL : ", bg="light grey").grid(row=0, sticky=W)
Label(master, text="Meta information :", bg="light grey").grid(row=3, sticky=W)
Label(master, text="URL information :", bg="light grey").grid(row=4, sticky=W)
 
 
# Creating label for class variable
# name using widget Entry
Label(master, text="", textvariable=meta,
      bg="light grey").grid(row=3, column=1, sticky=W)
Label(master, text="", textvariable=url, bg="light grey").grid(
    row=4, column=1, sticky=W)
 
 
e1 = Entry(master, width=100)
e1.grid(row=0, column=1)
 
 
# creating a button using the widget
# Button that will call the submit function
b = Button(master, text="Show", command=get_info, bg="Blue")
b.grid(row=0, column=2, columnspan=2, rowspan=2, padx=5, pady=5,)
 
 
mainloop()
 
# this code belongs to Satyam kumar (ksatyam858)