Open In App

Scrapy – Link Extractors

Last Updated : 09 Oct, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to learn about Link Extractors in scrapy. “LinkExtractor” is a class provided by scrapy to extract links from the response we get while fetching a website. They are very easy to use which we’ll see in the below post. 

Scrapy – Link Extractors

Basically using the “LinkExtractor” class of scrapy we can find out all the links which are present on a webpage and fetch them in a very easy way. We need to install the scrapy module (if not installed yet) by running the following command in the terminal:

pip install scrapy

Link Extractor class of Scrapy 

So, scrapy have the class “scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor” for extracting the links from a response object. For convenience scrapy also provides us with “scrapy.linkextractors.LinkExtractor“.

Firstly we need to import the LinkExtractor. There are quite a few ways to import and use the LinkExtractor class, but one of them is to import it in the following way:

from scrapy.linkextractors import LinkExtractor

Also, Using the LinkExtractor class. To use the “LinkExtractor” class you need to create the object as given below :

link_ext = LinkExtractor(arguments) 

We can also fetch the links. Now that we have created an object, to fetch links we will use the “extract_links” method of the LinkExtractor class. For that run below code : 

links = link_ext.extract_links(response)

The links fetched are in list format and of the type “scrapy.link.Link” . The parameters of the link object are:

  1. url : url of the fetched link.
  2. text : the text used in the anchor tag of the link.
  3. fragment : the part of the url after the hash (#) symbol.
  4. no-follow : tells whether the value of “rel” attribute of the anchor tag is “nofollow” or not.

Stepwise Implementation

Step 1: Creating a spider

A spider is basically a class in scrapy which is used to fetch requests and get a response from a particular website. The code for creating a spider is as follows:

Python3




# importing the LinkExtractor
import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class MySpider(scrapy.Spider):
      
    # you can name anything you want
    name = "MySpider"
      
    # urls to be fetched
    start_urls = []


So, here we first imported the scrapy module along with specifically importing the “LinkExtractor” class. Then we created a class named “MySpider” and inherited it from “scrapy.Spider” class.

Then we created a few class variables namely “name” and “start_urls“.

  • name : whatever name you want to give to the class
  • start_urls: all the URLs which need to be fetched are given here. 

Then those “start_urls” are fetched and the “parse function is run on the response obtained from each of them one by one. This is done automatically by scrapy.

Step 2: Creating the LinkExtractor object and Yielding  results

You can create an instance of the “LinkExtractor” class anywhere you want.

In this let us create an instance of the class in the “parse” method itself.

Python3




# Parse Method
def parse(self, response):
    
    # creating the instance of LinkExtractor
    # class
    link_extractor = LinkExtractor()
      
    # extracting links (returns List of links)
    links = link_extractor.extract_links(response)
  
    # Yielding results
    for link in links:
        
        # parameters of link : url, text,
        # fragment, nofollow
        # example yield output
        yield {"url": link.url, "text": link.text}


Finally, the full code is :

Python3




import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class MySpider(scrapy.Spider):
    name = "MySpider"
      
    # urls to be fetched
    start_urls = []
  
    def parse(self, response):
        link_extractor = LinkExtractor()
        links = link_extractor.extract_links(response)
          
        for link in links:
              
            # parameters of link : url, text, 
            # fragment, nofollow
            yield {"url": link.url, "text": link.text}


Step 3 : Running the code

Now we can run the “spider” and fetch the desired result in “json” file (or any other formats supported by scrapy) .

scrapy runspider <python-file> -o <output-file-name> 

Link Extractors using Scrapy

Example 1 :

Let us fetch all the links from the webpage https://quotes.toscrape.com/ and store the output in a JSON file named “quotes.json” :

Python3




# scrapy_link_extractor.py
import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class QuoteSpider(scrapy.Spider):
    name = "OuoteSpider"
    start_urls = ["https://quotes.toscrape.com/"]
  
    def parse(self, response):
        link_extractor = LinkExtractor()
        links = link_extractor.extract_links(response)
  
        for link in links:
            yield {"url": link.url, "text": link.text}


To run the above code we run the following command :

scrapy runspider scrapy_link_extractor.py -o quotes.json

Output:

scrapy link extractor example 1

Example 2 :

Let us this time fetch all the links from the website https://www.geeksforgeeks.org/email-id-extractor-project-from-sites-in-scrapy-python/  .

This time let us create the instance of the “LinkExtractor” class in the constructor of our Spider and also yield the “nofollow” parameter of the link object. Also let us set the “unique” parameter of “LinkExtractor” to “True” so that we fetch unique results only.

Python3




import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class GeeksForGeeksSpider(scrapy.Spider):
    name = "GeeksForGeeksSpider"
    start_urls = [
        "https://www.geeksforgeeks.org/email-id-extractor-\
        project-from-sites-in-scrapy-python/"]
  
    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)
  
        self.link_extractor = LinkExtractor(unique=True)
  
    def parse(self, response):
        links = self.link_extractor.extract_links(response)
  
        for link in links:
            yield {"nofollow": link.nofollow, "url": link.url, "text": link.text}


To run the above code, we run the following command in the terminal :

scrapy runspider scrapy_link_extractor.py -o geeksforgeeks.json

Output:

output of scrapy link extractor example 2



Previous Article
Next Article

Similar Reads

Deploying Scrapy spider on ScrapingHub
What is ScrapingHub ? Scrapy is an open source framework for web-crawling. This framework is written in python and originally made for web scraping. Web scraping can also be used to extract data using API. ScrapingHub provides the whole service to crawl the data from web pages, even for complex web pages. Why ScrapingHub ? Let's say a website which
5 min read
Scrapy - Spiders
Scrapy is a free and open-source web-crawling framework which is written purely in python. Thus, scrapy can be installed and imported like any other python package. The name of the package is self-explanatory. It is derived from the word 'scraping' which literally means extracting desired substance out of anything physically using a sharp tool. Scr
12 min read
Implementing Web Scraping in Python with Scrapy
Nowadays data is everything and if someone wants to get data from webpages then one way to use an API or implement Web Scraping techniques. In Python, Web scraping can be done easily by using scraping tools like BeautifulSoup. But what if the user is concerned about performance of scraper or need to scrape data efficiently. To overcome this problem
5 min read
Email Id Extractor Project from sites in Scrapy Python
Scrapy is open-source web-crawling framework written in Python used for web scraping, it can also be used to extract data for general-purpose. First all sub pages links are taken from the main page and then email id are scraped from these sub pages using regular expression. This article shows the email id extraction from geeksforgeeks site as a ref
8 min read
Difference between BeautifulSoup and Scrapy crawler
Web scraping is a technique to fetch data from websites. While surfing on the web, many websites don’t allow the user to save data for personal use. One way is to manually copy-paste the data, which both tedious and time-consuming. Web Scraping is the automation of the data extraction process from websites. This event is done with the help of web s
3 min read
How to get Scrapy Output File in XML File?
Prerequisite: Implementing Web Scraping in Python with Scrapy Scrapy provides a fast and efficient method to scrape a website. Web Scraping is used to extract the data from websites. In Scrapy we create a spider and then use it to crawl a website. In this article, we are going to extract population by country data from worldometers website. Let's i
2 min read
Scrapy - Settings
Scrapy is an open-source tool built with Python Framework. It presents us with a strong and robust web crawling framework that can easily extract the info from the online page with the assistance of selectors supported by XPath. We can define the behavior of Scrapy components with the help of Scrapy settings. Pipelines and setting files are very im
7 min read
How to download Files with Scrapy ?
Scrapy is a fast high-level web crawling and web scraping framework used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. In this tutorial, we will be exploring how to download files using a scrapy crawl spider. For beginners, web crawl
8 min read
Scrapy - Sending an E-mail
Prerequisites: Scrapy Scrapy provides its own facility for sending e-mails which is extremely easy to use, and it’s implemented using Twisted non-blocking IO, to avoid interfering with the non-blocking IO of the crawler. This article discusses how mail can be sent using scrapy. For this MailSender class needs to imported from scrapy and then a dedi
2 min read
Scraping dynamic content using Python-Scrapy
Let's suppose we are reading some content from a source like websites, and we want to save that data on our device. We can copy the data in a notebook or notepad for reuse in future jobs. This way, we used scraping(if we didn't have a font or database, the form brute removes the data in documents, sites, and codes). But now there exist many tools f
4 min read
Article Tags :
Practice Tags :