Open In App
Related Articles

Build an Application to extract URL and Metadata from a PDF using Python

Improve Article
Save Article
Like Article

The PDF (Portable Document Format) is the most common use platform-independent file format developed by Adobe to present documents. There are lots of PDF-related packages for Python, one of them is the pdfx module. The pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL.


  • Extract references and metadata from a given PDF.
  • Detects pdf, URL, arxiv, and DOI references.
  • The fast, parallel download of all referenced PDFs.
  • Check for broken links (using the -c flag).
  • Output as text or JSON (using the -j flag).
  • Extract the PDF text (using the –text flag).
  • Use a command-line tool or Python package.
  • Compatible with Python 2 and 3.
  • Works with local and online pdfs.

Getting Started:

First, we need to install pdfx module, run the below code in the terminal.

pip install pdfx


  • Import pdfx module.
  • Read PDF file with pdfx.PDFx() method.
  • Get metadata with get_metadata() method.
  • Get URL with get_references_as_dict() method.


Step 1: Importing modules and reading PDF files.


# import module
import pdfx
# reading pdf file
pdf = pdfx.PDFx("geeksforgeeks.pdf")
# display


<pdfx.PDFx at 0x1c189244a88>

It means pdfx.PDFx object created at 0x1c189244a88 this location on your memories.

Step 2: Getting metadata from PDF.




{‘Creator’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36’, 
‘Producer’: ‘Skia/PDF m85’, 
‘CreationDate’: “D:20200911041438+00’00′”, 
‘ModDate’: “D:20200911041438+00’00′”, 
‘Pages’: 2}

Step 3: Getting the URL form PDF.




{'url': ['',

Application to extract URL and Metadata from a PDF with tkinter: below script implements the above approach into a Graphical User Interface.


# import modules
from tkinter import *
import pdfx
# user defined function
def get_info():
    pdf = pdfx.PDFx(str(e1.get()))
# object of tkinter
# and background set for light grey
master = Tk()
master.configure(bg='light grey')
# Variable Classes in tkinter
meta = StringVar()
url = StringVar()
# Creating label for each information
# name using widget Label
Label(master, text="PDF or PDF-URL : ", bg="light grey").grid(row=0, sticky=W)
Label(master, text="Meta information :", bg="light grey").grid(row=3, sticky=W)
Label(master, text="URL information :", bg="light grey").grid(row=4, sticky=W)
# Creating label for class variable
# name using widget Entry
Label(master, text="", textvariable=meta,
      bg="light grey").grid(row=3, column=1, sticky=W)
Label(master, text="", textvariable=url, bg="light grey").grid(
    row=4, column=1, sticky=W)
e1 = Entry(master, width=100)
e1.grid(row=0, column=1)
# creating a button using the widget
# Button that will call the submit function
b = Button(master, text="Show", command=get_info, bg="Blue")
b.grid(row=0, column=2, columnspan=2, rowspan=2, padx=5, pady=5,)
# this code belongs to Satyam kumar (ksatyam858)


Whether you're preparing for your first job interview or aiming to upskill in this ever-evolving tech landscape, GeeksforGeeks Courses are your key to success. We provide top-quality content at affordable prices, all geared towards accelerating your growth in a time-bound manner. Join the millions we've already empowered, and we're here to do the same for you. Don't miss out - check it out now!

Last Updated : 29 Dec, 2022
Like Article
Save Article
Similar Reads
Complete Tutorials