Skip to content
Related Articles

Related Articles

Improve Article
Build an Application to extract URL and Metadata from a PDF using Python
  • Last Updated : 29 Dec, 2020

The PDF (Portable Document Format) is the most common use platform-independent file format developed by Adobe to present documents. There are lots of PDF-related packages for Python, one of them is the pdfx module. The pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL.

Features:

  • Extract references and metadata from a given PDF.
  • Detects pdf, URL, arxiv, and DOI references.
  • The fast, parallel download of all referenced PDFs.
  • Check for broken links (using the -c flag).
  • Output as text or JSON (using the -j flag).
  • Extract the PDF text (using the –text flag).
  • Use a command-line tool or Python package.
  • Compatible with Python 2 and 3.
  • Works with local and online pdfs.

Getting Started:

First, we need to install pdfx module, run the below code in the terminal.

pip install pdfx

Approach:

  • Import pdfx module.
  • Read PDF file with pdfx.PDFx() method.
  • Get metadata with get_metadata() method.
  • Get URL with get_references_as_dict() method.

Implementation:



Step 1: Importing modules and reading PDF files.

Python3




# import module
import pdfx
  
# reading pdf file
pdf = pdfx.PDFx("geeksforgeeks.pdf")
  
# display
print(pdf)

Output:

<pdfx.PDFx at 0x1c189244a88>

It means pdfx.PDFx object created at 0x1c189244a88 this location on your memories.

Step 2: Getting metadata from PDF.

Python3




pdf.get_metadata()

Output:

{‘Creator’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36’,
‘Producer’: ‘Skia/PDF m85’,
‘CreationDate’: “D:20200911041438+00’00′”,
‘ModDate’: “D:20200911041438+00’00′”,
‘Pages’: 2}



Step 3: Getting the URL form PDF.

Python3




pdf.get_references_as_dict()

Output:

{'url': ['https://www.geeksforgeeks.org/cookie-policy/',
  'https://www.geeksforgeeks.org/privacy-policy/',
  'https://www.geeksforgeeks.org/',
  'https://www.geeksforgeeks.org/optparse-module-in-python/']}

Application to extract URL and Metadata from a PDF with tkinter: below script implements the above approach into a Graphical User Interface.

Python3




# import modules
from tkinter import *
import pdfx
  
  
# user defined funtion
def get_info():
  
    pdf = pdfx.PDFx(str(e1.get()))
    meta.set(pdf.get_metadata())
    url.set(pdf.get_references_as_dict())
  
  
# object of tkinter
# and background set for light grey
master = Tk()
master.configure(bg='light grey')
  
  
# Variable Classes in tkinter
meta = StringVar()
url = StringVar()
  
  
# Creating label for each information
# name using widget Label
Label(master, text="PDF or PDF-URL : ", bg="light grey").grid(row=0, sticky=W)
Label(master, text="Meta information :", bg="light grey").grid(row=3, sticky=W)
Label(master, text="URL information :", bg="light grey").grid(row=4, sticky=W)
  
  
# Creating lebel for class variable
# name using widget Entry
Label(master, text="", textvariable=meta,
      bg="light grey").grid(row=3, column=1, sticky=W)
Label(master, text="", textvariable=url, bg="light grey").grid(
    row=4, column=1, sticky=W)
  
  
e1 = Entry(master, width=100)
e1.grid(row=0, column=1)
  
  
# creating a button using the widget
# Button that will call the submit function
b = Button(master, text="Show", command=get_info, bg="Blue")
b.grid(row=0, column=2, columnspan=2, rowspan=2, padx=5, pady=5,)
  
  
mainloop()
  
# this code belongs to Satyam kumar (ksatyam858)

Output:

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :