Skip to content
Related Articles

Related Articles

Improve Article

Extract hyperlinks from PDF in Python

  • Last Updated : 13 Jan, 2021

Prerequisite: PyPDF2, Regex

In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:

  • Using PyPDF2
  • Using pdfx

Method 1: Using PyPDF2.

PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.

Approach:



  • Read the PDF file and convert it into text
  • Get URL from text Using Regular Expression

Let’s Implement this module step-wise:

Step 1: Open and Read the PDF file.

Python3




import PyPDF2
  
  
file = "Enter PDF File Name"
  
pdfFileObject = open(file, 'rb')
   
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
   
for page_number in range(pdfReader.numPages):
      
    pageObject = pdfReader.getPage(page_number)
    pdf_text = pageObject.extractText()
    print(pdf_text)
      
pdfFileObject.close()

Output:

Step 2: Use Regular Expression to find URL from String

Python3




# Import Module
import PyPDF2
import re 
  
# Enter File Name
file = "Enter PDF File Name"
  
# Open File file
pdfFileObject = open(file, 'rb')
   
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
  
# Regular Expression (Get URL from String)
def Find(string): 
    
    # findall() has been used 
    # with valid conditions for urls in string 
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url] 
    
# Iterate through all pages
for page_number in range(pdfReader.numPages):
      
    pageObject = pdfReader.getPage(page_number)
      
    # Extract text from page
    pdf_text = pageObject.extractText()
      
    # Print all URL
    print(Find(pdf_text))
      
# CLost the PDF 
pdfFileObject.close()

Output:



['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']

Method 2: Using pdfx.

In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.

pip install pdfx

Below is the implementation:

Python3




# Import Module
import pdfx 
  
# Read PDF File
pdf = pdfx.PDFx("File Name"
  
# Get list of URL
print(pdf.get_references_as_dict())

Output:-

{'url': ['https://www.geeksforgeeks.org/',
  'https://docs.python.org/',
  'https://pythonhosted.org/PyPDF2/',
  'GeeksforGeeks.org']}

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :