Extract hyperlinks from PDF in Python
Last Updated :
16 Oct, 2021
Prerequisite: PyPDF2, Regex
In this article, We are going to extract hyperlinks from PDF in Python. It can be done in different ways:
Method 1: Using PyPDF2.
PyPDF2 is a python library built as a PDF toolkit. It is capable of Extracting document information and many more.
Approach:
- Read the PDF file and convert it into text
- Get URL from text Using Regular Expression
Let’s Implement this module step-wise:
Step 1: Open and Read the PDF file.
Python3
import PyPDF2
file = "Enter PDF File Name"
pdfFileObject = open ( file , 'rb' )
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
for page_number in range (pdfReader.numPages):
pageObject = pdfReader.getPage(page_number)
pdf_text = pageObject.extractText()
print (pdf_text)
pdfFileObject.close()
|
Output:
Step 2: Use Regular Expression to find URL from String
Python3
import PyPDF2
import re
file = "Enter PDF File Name"
pdfFileObject = open ( file , 'rb' )
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
def Find(string):
regex = r "(https?://\S+)"
url = re.findall(regex,string)
return [x for x in url]
for page_number in range (pdfReader.numPages):
pageObject = pdfReader.getPage(page_number)
pdf_text = pageObject.extractText()
print (Find(pdf_text))
pdfFileObject.close()
|
Output:
['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']
Method 2: Using pdfx.
In this method, we will use pdfx module. pdfx module is used to extract URL, MetaData, and Plain text from a given PDF or PDF URL. Features: Extract references and metadata from a given PDF.
pip install pdfx
Below is the implementation:
Python3
import pdfx
pdf = pdfx.PDFx( "File Name" )
print (pdf.get_references_as_dict())
|
Output:-
{'url': ['https://www.geeksforgeeks.org/',
'https://docs.python.org/',
'https://pythonhosted.org/PyPDF2/',
'GeeksforGeeks.org']}
Like Article
Suggest improvement
Share your thoughts in the comments
Please Login to comment...