Extract all the URLs from the webpage Using R Language
In this article, we will learn how to scrap all the URLs from the webpage using the R Programming language.
To scrap URLs we will be using httr and XML libraries. We will be using httr package to make HTTP requestsXML and XML to identify URLs using xml tags.
- httr library is used to make HTTP requests in R language as it provides a wrapper for the curl package.
- XML library is used for working with XML files and XML tags.
Installation:
install.packages(“httr”)
install.packages(“XML”)
After installing the required packages we need to import httr and XML libraries and create a variable and store the URL of the site. Now we will be using GET() of httr packages to make HTTP requests, so we have raw data and we need to convert it in HTML format which can be done by using htmlParse()the
We have successfully scrapped HTML data but we only need URLs so to scrap URL we xpathSApply() and pass the HTML data to it, We have not yet completed now we have to pass XML tag to it so that we can get everything related to that tag. For URLs, we will “href” tag which is used to declare URLs.
Note: You need not use install.packages() if you have already installed the package once.
Stepwise Implementation
Step 1: Installing libraries:
R
# installing packages install.packages ( "httr" ) install.packages ( "XML" ) |
Step 2: Import libraries:
R
# importing packages library (httr) library (XML) |
Step 3: Making HTTP requests:
In this step, we will pass our URL in GET() to request site data and store the returned data in the resource variable.
R
Step 4: Parse site data in HTML format:
In this step, we parsed the data to HTML using htmlparse().
R
# parsing data to html format parse<- htmlParse (resource) |
Step 5: Identify URLs and print them:
In this step, we used to xpathSApply() to locate URLs.
R
# scrapping all the href tags links<- xpathSApply (parse,path = "//a" ,xmlGetAttr, "href" ) # printing links print (links) |
We know <a> tag is used to define URL and it is stored in href attribute.
<a href=”url”></a>
So xpathSApply() will find all the <a> tags and scrap the link stored in href attribute. And then we will store all the URLs in a variable and print it.
Example:
R
# installing packages install.packages ( "httr" ) install.packages ( "XML" ) # importing packages library (httr) library (XML) # storing request url in url variable url < - "https://www.geeksforgeeks.org" # making http request resource < - GET (url) # converting all the data to HTML format parse < - htmlParse (resource) # scrapping all the href tags links < - xpathSApply (parse, path= "//a" , xmlGetAttr, "href" ) # printing links print (links) |
Output:
Please Login to comment...