Open In App

How to not get caught while web scraping ?

In this article, we are going to discuss how to not get caught while web scraping. Let’s look at all such alternatives in detail:

Robots.txt



IP Rotation

Proxy Types:

Example: 

Syntax: 



requests.get(url, proxies={“http”: proxy, “https”: proxy})

Program:




# Import the required Modules
import requests
 
# Create a pool of proxies
proxies = {
           'http://114.121.248.251:8080',
           'http://222.85.190.32:8090',
           'http://47.107.128.69:888',
           'http://41.65.146.38:8080',
           'http://190.63.184.11:8080',
           'http://45.7.135.34:999',
           'http://141.94.104.25:8080',
           'http://222.74.202.229:8080',
           'http://141.94.106.43:8080',
           'http://191.101.39.96:80'
           }
 
# Iterate the proxies and check if it is working.
for proxy in proxies:
    try:
        # https://ipecho.net/plain returns the ip address
        # of the current session if a GET request is sent.
        page = requests.get(url,
                proxies={"http": proxy, "https": proxy})
         
        # Prints Proxy server IP address if proxy is alive.
        print("Status OK, Output:", page.text)  
    except OSError as e:
         
        # Proxy returns Connection error
        print(e)

User-Agent

Example: 




# Create a list of User-Agents
import requests
 
 
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) \
AppleWebKit/537.2 (KHTML, like Gecko) Chrome/22.0.1216.0 \
Safari/537.2'}
response = requests.get(url, headers=header)
 
 
# Use UserAgent from fake_useragent module
import requests
from fake_useragent import UserAgent
 
 
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
response = requests.get(url, headers=header)

Referrer Header

Syntax:

requests.get(url, headers={‘referrer’: referrer_url})

Headless Browser

Example:




# Using Selenium Chrome Webdriver to create
# a headless browser
 
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(executable_path="C:/chromedriver.exe",
                          chrome_options=options)
driver.get(url)

Time Intervals

Example: 

time.sleep(1)   # Sleeps for 1 second

Captcha Solving

There are a few CAPTCHA solving services like:  

Avoid Honeypot Traps

Detect Website Changes

So these are the ways by which you can avoid getting caught during web scraping.


Article Tags :