Build an Application to extract URL and Metadata from a PDF using Python

The PDF (Portable Document Format) is the most common use platform-independent file format developed by Adobe to present documents. There are lots of PDF-related packages for Python, one of them is the pdfx module. The pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL.

Features:

  • Extract references and metadata from a given PDF.
  • Detects pdf, URL, arxiv, and DOI references.
  • The fast, parallel download of all referenced PDFs.
  • Check for broken links (using the -c flag).
  • Output as text or JSON (using the -j flag).
  • Extract the PDF text (using the –text flag).
  • Use a command-line tool or Python package.
  • Compatible with Python 2 and 3.
  • Works with local and online pdfs.

Getting Started:

First, we need to install pdfx module, run the below code in the terminal.

pip install pdfx

Approach:

  • Import pdfx module.
  • Read PDF file with pdfx.PDFx() method.
  • Get metadata with get_metadata() method.
  • Get URL with get_references_as_dict() method.

Implementation:



Step 1: Importing modules and reading PDF files.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# import module
import pdfx
  
# reading pdf file
pdf = pdfx.PDFx("geeksforgeeks.pdf")
  
# display
print(pdf)

chevron_right


Output:

<pdfx.PDFx at 0x1c189244a88>

It means pdfx.PDFx object created at 0x1c189244a88 this location on your memories.

Step 2: Getting metadata from PDF.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

pdf.get_metadata()

chevron_right


Output:

{‘Creator’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36’,
‘Producer’: ‘Skia/PDF m85’,
‘CreationDate’: “D:20200911041438+00’00′”,
‘ModDate’: “D:20200911041438+00’00′”,
‘Pages’: 2}



Step 3: Getting the URL form PDF.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

pdf.get_references_as_dict()

chevron_right


Output:

{'url': ['https://www.geeksforgeeks.org/cookie-policy/',
  'https://www.geeksforgeeks.org/privacy-policy/',
  'https://www.geeksforgeeks.org/',
  'https://www.geeksforgeeks.org/optparse-module-in-python/']}

Application to extract URL and Metadata from a PDF with tkinter: below script implements the above approach into a Graphical User Interface.

Python3

filter_none

edit
close

play_arrow

link
brightness_4
code

# import modules
from tkinter import *
import pdfx
  
  
# user defined funtion
def get_info():
  
    pdf = pdfx.PDFx(str(e1.get()))
    meta.set(pdf.get_metadata())
    url.set(pdf.get_references_as_dict())
  
  
# object of tkinter
# and background set for light grey
master = Tk()
master.configure(bg='light grey')
  
  
# Variable Classes in tkinter
meta = StringVar()
url = StringVar()
  
  
# Creating label for each information
# name using widget Label
Label(master, text="PDF or PDF-URL : ", bg="light grey").grid(row=0, sticky=W)
Label(master, text="Meta information :", bg="light grey").grid(row=3, sticky=W)
Label(master, text="URL information :", bg="light grey").grid(row=4, sticky=W)
  
  
# Creating lebel for class variable
# name using widget Entry
Label(master, text="", textvariable=meta,
      bg="light grey").grid(row=3, column=1, sticky=W)
Label(master, text="", textvariable=url, bg="light grey").grid(
    row=4, column=1, sticky=W)
  
  
e1 = Entry(master, width=100)
e1.grid(row=0, column=1)
  
  
# creating a button using the widget
# Button that will call the submit function
b = Button(master, text="Show", command=get_info, bg="Blue")
b.grid(row=0, column=2, columnspan=2, rowspan=2, padx=5, pady=5,)
  
  
mainloop()
  
# this code belongs to Satyam kumar (ksatyam858)

chevron_right


Output:

Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course.




My Personal Notes arrow_drop_up

Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.


Article Tags :

Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.