Extract hyperlinks from PDF in Python

Last Updated : 16 Oct, 2021

Prerequisite: PyPDF2, Regex

In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:

Using PyPDF2
Using pdfx

Method 1: Using PyPDF2.

PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.

Approach:

Read the PDF file and convert it into text
Get URL from text Using Regular Expression

Let’s Implement this module step-wise:

Step 1: Open and Read the PDF file.

Python3

import PyPDF2
 
file = "Enter PDF File Name"
 
pdfFileObject = open(file, 'rb')
  
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
  
for page_number in range(pdfReader.numPages):
     
    pageObject = pdfReader.getPage(page_number)
    pdf_text = pageObject.extractText()
    print(pdf_text)
     
pdfFileObject.close()

Output:

Step 2: Use Regular Expression to find URL from String

Python3

# Import Module
import PyPDF2
import re 
 
# Enter File Name
file = "Enter PDF File Name"
 
# Open File file
pdfFileObject = open(file, 'rb')
  
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 
# Regular Expression (Get URL from String)
def Find(string): 
   
    # findall() has been used 
    # with valid conditions for urls in string 
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url] 
   
# Iterate through all pages
for page_number in range(pdfReader.numPages):
     
    pageObject = pdfReader.getPage(page_number)
     
    # Extract text from page
    pdf_text = pageObject.extractText()
     
    # Print all URL
    print(Find(pdf_text))
     
# CLose the PDF 
pdfFileObject.close()

Output:

['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']

Method 2: Using pdfx.

In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.

pip install pdfx

Below is the implementation:

Python3

# Import Module
import pdfx 
 
# Read PDF File
pdf = pdfx.PDFx("File Name") 
 
# Get list of URL
print(pdf.get_references_as_dict())

Output:-

{'url': ['https://www.geeksforgeeks.org/',
  'https://docs.python.org/',
  'https://pythonhosted.org/PyPDF2/',
  'GeeksforGeeks.org']}

Suggest improvement

How to extract images from PDF in Python?

Share your thoughts in the comments

Extract hyperlinks from PDF in Python

Python3

Python3

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?