We can scrape the IMDb movie ratings and their details with the help of the BeautifulSoup library of Python.
Modules Needed:
Below is the list of modules required to scrape from IMDB.
- requests: Requests library is an integral part of Python for making HTTP requests to a specified URL. Whether it be REST APIs or Web Scraping, requests must be learned for proceeding further with these technologies. When one makes a request to a URI, it returns a response.
- html5lib: A pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
- bs4: BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster.
- pandas: Pandas is a library made over the NumPy library which provides various data structures and operators to manipulate the numerical data.
Approach:
Steps to implement web scraping in python to extract IMDb movie ratings and its ratings:
- Import the required modules.
Python3
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
|
- Access the HTML content from the webpage by assigning the URL and creating a soap object.
Python3
# Downloading imdb top 250 movie's data response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser" )
|
- Extract the movie ratings and their details. Here, we are extracting data from the BeautifulSoup object using Html tags like href, title, etc.
Python3
movies = soup.select( 'td.titleColumn' )
crew = [a.attrs.get( 'title' ) for a in soup.select( 'td.titleColumn a' )]
ratings = [b.attrs.get( 'data-value' )
for b in soup.select( 'td.posterColumn span[name=ir]' )]
|
- After extracting the movie details, create an empty list and store the details in a dictionary, and then add them to a list.
Python3
# create a empty list for storing # movie information list = []
# Iterating over movies to extract # each movie's details for index in range ( 0 , len (movies)):
# Separating movie into: 'place',
# 'title', 'year'
movie_string = movies[index].get_text()
movie = ( ' ' .join(movie_string.split()).replace( '.' , ''))
movie_title = movie[ len ( str (index)) + 1 : - 7 ]
year = re.search( '\((.*?)\)' , movie_string).group( 1 )
place = movie[: len ( str (index)) - ( len (movie))]
data = { "place" : place,
"movie_title" : movie_title,
"rating" : ratings[index],
"year" : year,
"star_cast" : crew[index],
}
list .append(data)
|
- Now or list is filled with top IMBD movies along with their details. Then display the list of movie details
Python3
for movie in list :
print (movie[ 'place' ], '-' , movie[ 'movie_title' ], '(' + movie[ 'year' ] +
') -' , 'Starring:' , movie[ 'star_cast' ], movie[ 'rating' ])
|
- By using the following lines of code the same data can be saved into a .csv file be further used as a dataset.
Python3
#saving the list as dataframe #then converting into .csv file df = pd.DataFrame( list )
df.to_csv( 'imdb_top_250_movies.csv' ,index = False )
|
Implementation: Complete Code
Python3
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
# Downloading imdb top 250 movie's data response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser" )
movies = soup.select( 'td.titleColumn' )
crew = [a.attrs.get( 'title' ) for a in soup.select( 'td.titleColumn a' )]
ratings = [b.attrs.get( 'data-value' )
for b in soup.select( 'td.posterColumn span[name=ir]' )]
# create a empty list for storing # movie information list = []
# Iterating over movies to extract # each movie's details for index in range ( 0 , len (movies)):
# Separating movie into: 'place',
# 'title', 'year'
movie_string = movies[index].get_text()
movie = ( ' ' .join(movie_string.split()).replace( '.' , ''))
movie_title = movie[ len ( str (index)) + 1 : - 7 ]
year = re.search( '\((.*?)\)' , movie_string).group( 1 )
place = movie[: len ( str (index)) - ( len (movie))]
data = { "place" : place,
"movie_title" : movie_title,
"rating" : ratings[index],
"year" : year,
"star_cast" : crew[index],
}
list .append(data)
# printing movie details with its rating. for movie in list :
print (movie[ 'place' ], '-' , movie[ 'movie_title' ], '(' + movie[ 'year' ] +
') -' , 'Starring:' , movie[ 'star_cast' ], movie[ 'rating' ])
##.......## df = pd.DataFrame( list )
df.to_csv( 'imdb_top_250_movies.csv' ,index = False )
|
Output:
Along with this in the terminal, a .csv file with a given name is saved in the same file and the data in the .csv file will be as shown in the following image.