Skip to content
Related Articles

Related Articles

Improve Article

The Complete Guide to Proxies For Web Scraping

  • Last Updated : 13 Jul, 2021

In computer networking, a proxy server is a server application or appliance that acts as an intermediary for requests from clients seeking resources from servers that provide those resources. 

Since web scraping requires a lot of requests made to a server from an IP address, the server may detect too many requests and may block the IP address to stop further scraping. To avoid blocking, proxies are used and scraping will continue working as the IP address is changed and won’t cause any issues. It also helps in hiding the machine’s IP address as it creates anonymity.

Proxy Types

There are three types of proxies.

  • Data Center Proxy: These proxies are from cloud service providers and are sometimes flagged as many people use them, but since they are cheaper, a pool of proxies can be brought for web scraping activities.
  • Residential IP Proxy: These proxies contain IP addresses from local ISP, so the webmaster cannot detect if it is a scraper or a real person browsing the website. They are very expensive compared to Data Center Proxies and may cause legal consents as the owner isn’t fully aware if you are using their IP for web scraping purposes.
  • Mobile IP Proxy: These proxies are IPs of private mobile devices and work similarly to Residential IP Proxies. They are very expensive and may cause legal consents as the device owner isn’t fully aware if you are using their GSM network for web scraping since they are provided by mobile network operators.

Managing Proxy Pool

  • Identify Bans – The proxy should be able to detect various types of blocking methods and fix the underlying problems – i.e. captchas, redirects, blocks, ghosting, etc.
  • Retry Errors – Retry the request using a different proxy server if there are any connection problems, blocks, captchas, etc with the current proxy.
  • Control Proxies – Few websites with authentication require to keep the session with the same IP or else authentication might be required again if there is any change in proxy server.
  • Adding Delays – Randomize delays and apply good throttling so the website cannot detect that you are scraping.
  • Geographical Location – Few websites may require IP’s from specific countries, so the proxy pool should contain the set of proxies from the given geolocation.

Public Proxies are not recommended as they are of low quality and are also considered dangerous as they can infect the machine and even make the web scraping activity public if the SSL certificates are not configured properly.

Shared proxies are generally used if the budget is low and a shared pool of IP’s is required. If the budget is higher and performance is top priority then dedicated pool of proxies is the way to go.



Proxy Rotation

Sending too many requests from a single IP address is a clear indication that you are automating HTTP/HTTPS requests and the webmaster will surely block your IP address to stop further scraping. The best alternative is to create a proxy pool and iterate/rotate them after a certain amount of requests from a single proxy server.

This reduces the chances of IP blocking and the scraper remains unaffected.

proxies = {‘http://78.47.16.54:80’, ‘http://203.75.190.21:80’,  ‘http://77.72.3.163:80’}

How to use a proxy in requests module?

  • Import the requests module.
  • Create a pool of proxies and then rotate/iterate them.
  • Send a GET request using requests.get() by passing the proxy as a parameter to the URL.
  • Returns the proxy server address of the current session if there is no connection error.

Program:

Python3




import requests
  
# Initialise proxy and url.
  
# Send a GET request to the url and
# pass the proxy as parameter.
page = requests.get(url,
                    proxies={"http": proxy, "https": proxy})
  
# Prints the content of the requested url.
print(page.text)

Output:

114.121.248.251

The same can be applied to multiple proxies, given below is the implementation for the same.

Program:

Python3




# Import the required Modules
import requests
  
# Create a pool of proxies
proxies = {
}
  
  
# Iterate the proxies and check if it is working.
for proxy in proxies:
    try:
  
        # https://ipecho.net/plain returns the ip address
        # of the current session if a GET request is sent.
        page = requests.get(
          url, proxies={"http": proxy, "https": proxy})
  
        # Prints Proxy server IP address if proxy is alive.
        print("Status OK, Output:", page.text)
  
    except OSError as e:
  
        # Proxy returns Connection error
        print(e)

Output:

proxy in request module

 Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics.  

To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course




My Personal Notes arrow_drop_up
Recommended Articles
Page :