Prerequisite:
Web scraping is a technique to fetch data from websites. While surfing on the web, many websites don’t allow the user to save data for personal use. One way is to manually copy-paste the data, which both tedious and time-consuming. Web Scraping is the automation of the data extraction process from websites. In this article we will discuss how we can download all images from a web page using python.
Modules Needed
- bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python.
- requests: Requests allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python.
- os: The OS module in python provides functions for interacting with the operating system. OS, comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality.
Approach
- Import module
- Get HTML Code
- Get list of img tags from HTML Code using findAll method in Beautiful Soup.
images = soup.findAll('img')
Create separate folder for downloading images using mkdir method in os.
os.mkdir(folder_name)
- Iterate through all images and get the source URL of that image.
- After getting the source URL, last step is download the image
- Fetch Content of Image
r = requests.get(Source URL).content
- Download image using File Handling
# Enter File Name with Extension like jpg, png etc.. with open("File Name","wb+") as f: f.write(r)
Program:
Python3
from bs4 import *
import requests
import os
# CREATE FOLDER def folder_create(images):
try :
folder_name = input ( "Enter Folder Name:- " )
# folder creation
os.mkdir(folder_name)
# if folder exists with that name, ask another name
except :
print ( "Folder Exist with that name!" )
folder_create()
# image downloading start
download_images(images, folder_name)
# DOWNLOAD ALL IMAGES FROM THAT URL def download_images(images, folder_name):
# initial count is zero
count = 0
# print total images found in URL
print (f "Total {len(images)} Image Found!" )
# checking if images is not zero
if len (images) ! = 0 :
for i, image in enumerate (images):
# From image tag ,Fetch image Source URL
# 1.data-srcset
# 2.data-src
# 3.data-fallback-src
# 4.src
# Here we will use exception handling
# first we will search for "data-srcset" in img tag
try :
# In image tag ,searching for "data-srcset"
image_link = image[ "data-srcset" ]
# then we will search for "data-src" in img
# tag and so on..
except :
try :
# In image tag ,searching for "data-src"
image_link = image[ "data-src" ]
except :
try :
# In image tag ,searching for "data-fallback-src"
image_link = image[ "data-fallback-src" ]
except :
try :
# In image tag ,searching for "src"
image_link = image[ "src" ]
# if no Source URL found
except :
pass
# After getting Image Source URL
# We will try to get the content of image
try :
r = requests.get(image_link).content
try :
# possibility of decode
r = str (r, 'utf-8' )
except UnicodeDecodeError:
# After checking above condition, Image Download start
with open (f "{folder_name}/images{i+1}.jpg" , "wb+" ) as f:
f.write(r)
# counting number of image downloaded
count + = 1
except :
pass
# There might be possible, that all
# images not download
# if all images download
if count = = len (images):
print ( "All Images Downloaded!" )
# if all images not download
else :
print (f "Total {count} Images Downloaded Out of {len(images)}" )
# MAIN FUNCTION START def main(url):
# content of URL
r = requests.get(url)
# Parse HTML Code
soup = BeautifulSoup(r.text, 'html.parser' )
# find all images in URL
images = soup.findAll( 'img' )
# Call folder create function
folder_create(images)
# take url url = input ( "Enter URL:- " )
# CALL MAIN FUNCTION main(url) |
Output: