Movie recommender based on plot summary using TF-IDF Vectorization and Cosine similarity
Recommending movies to users can be done in multiple ways using content-based filtering and collaborative filtering approaches. Content-based filtering approach primarily focuses on the item similarity i.e., the similarity in movies, whereas collaborative filtering focuses on drawing a relation between different users of similar choices in watching movies.
Based on the plot of a movie that was watched by the user in the past, movies with a similar plot can be recommended to the user. This approach comes under content-based filtering as the recommendations are done only based on the user’s past activity.
Dataset used: A kaggle dataset which was scraped from wikipedia and contains plot summary of movies.
Code: Reading the dataset:
34886 #Length of the dataset (Total number of rows/movies) #Movies of various origins present in the dataset. array(['American', 'Assamese', 'Australian', 'Bangladeshi', 'Bengali', 'Bollywood', 'British', 'Canadian', 'Chinese', 'Egyptian', 'Filipino', 'Hong Kong', 'Japanese', 'Kannada', 'Malayalam', 'Malaysian', 'Maldivian', 'Marathi', 'Punjabi', 'Russian', 'South_Korean', 'Tamil', 'Telugu', 'Turkish'], dtype=object) 17377 #Number of movies of American origin 3670 #Number of movies of British origin
Out of the different columns in the dataset, only the required columns are the movie name and the movie plot. Considering a subset of the above dataset, we use only American and British movies. The subset dataset consists of 21047 movies.
21047 #Number of rows in the new dataset # First 10 rows of the new dataset #Plot of the first movie A bartender is working at a saloon, serving drinks to customers. After he fills a stereotypically Irish man's bucket with beer, Carrie Nation and her followers burst inside. They assault the Irish man, pulling his hat over his eyes and then dumping the beer over his head. The group then begin wrecking the bar, smashing the fixtures, mirrors, and breaking the cash register. The bartender then sprays seltzer water in Nation's face before a group of policemen appear and order everybody to leave.
Code: Applying natural language processing techniques to pre-process the movie plots:
Data pre-processing steps:
- Plot summary is converted into tokens, Using NLTK word tokenizer.
- Using NLTK POS tagger, POS tags of tokens are extracted.
- Lemmatization is considered better over stemming, because lemmatization does morphological analysis of words.
- Lemmatization is done by removing inflectional endings of tokens, through NLTK Word-net lemmatizer.
- Common words are removed to increase the importance of tokens. From NLTK library, English stop words are downloaded and removed from the movie plots.
- Few general contractions are replaced with original words.
TF-IDF (TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY) Vectorization:
- Term Frequency (TF):The number of times a word appears in a document divided by the total number of words in the document. Every document has its own term frequency.
- Inverse Data Frequency (IDF): The log of the number of documents divided by the number of documents that contain the word. Inverse data frequency determines the weight of rare words across all documents in the corpus.
Scikit-Learn provides a transformer called the TfidfVectorizer in the module called feature_extraction.text for vectorizing with TF–IDF scores.
The movie plots are transformed as vectors in a geometric space. Therefore the angle between two vectors represents the closeness of those two vectors. Cosine similarity calculates similarity by measuring the cosine of the angle between two vectors.
Code: Building recommendation function which gives top 10 similar movies:
Code: Using the above function to get plot based recommendations:
Recommendations for the movie "Harry Potter and the Chamber of Secrets" ["Harry Potter and the Sorcerer's Stone", "Harry Potter and the Philosopher's Stone", 'Harry Potter and the Deathly Hallows: Part I', 'Harry Potter and the Deathly Hallows: Part 1', 'Harry Potter and the Half-Blood Prince', 'Harry Potter and the Deathly Hallows: Part II', 'Harry Potter and the Deathly Hallows: Part 2', 'Harry Potter and the Order of the Phoenix', 'Harry Potter and the Goblet of Fire', 'Harry Potter and the Prisoner of Azkaban']
Recommendations for the movie "Ice Age" ['Ice Age: The Meltdown', 'Ice Age: Dawn of the Dinosaurs', 'The Wrong Man', 'Ice Age: Continental Drift', 'The Buttercup Chain', 'Ice Age: Collision Course', 'Runaway Train', 'Corrina, Corrina', 'Sid and Nancy', 'Zorro, the Gay Blade']
Recommendations for the movie "Blackmail" ['Checkpoint', 'Odds Against Tomorrow', 'The Beast with Five Fingers', 'Fruitvale Station', 'The Exile', 'The Black Swan', 'Small Town Gay Bar', 'Eye of the Cat', 'Blown Away', 'Brenda Starr, Reporter']