Recommender System using Pyspark – Python
A recommender system is a type of information filtering system that provides personalized recommendations to users based on their preferences, interests, and past behaviors. Recommender systems come in a variety of forms, such as content-based, collaborative filtering, and hybrid systems. Content-based systems make recommendations for products based on how closely their characteristics match those of products the user has previously expressed interest in. Collaborative filtering systems recommend items based on the preferences of users who have similar interests to the user being recommended. Hybrid systems combine both content-based and collaborative filtering approaches to make recommendations.
We will implement this with the help of Collaborative Filtering. Collaborative filtering involves making predictions (filtering) about a user’s interests by compiling preferences or taste data from numerous users (collaborating). The essential premise is that, if two users A and B share the same opinion on a subject, A is more likely to share B’s opinion on a related but unrelated subject, x, than the opinion of a randomly selected user.
Recommender System using Pyspark
Collaborative filtering is implemented by the machine learning library Spark MLlib using Alternating Least Squares. These parameters apply to the MLlib implementation:
- The number of blocks used to parallelize computation is numBlocks (set to -1 to auto-configure).
- The number of latent factors in the model is its rank.
- The number of iterations to execute is known as an iteration.
- The regularisation parameter in ALS is specified by lambda.
- Whether to utilize the ALS variation tailored for implicit feedback data or the explicit feedback variant is determined by implicitPrefs.
- The implicit feedback variant of ALS has a parameter called alpha that controls the initial level of confidence in preference observations.
In this, we will use the dataset of the book review.
Step 1: Import the necessary libraries and functions and Setup Spark Session
Python3
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
spark = SparkSession.builder.appName( 'Recommender' ).getOrCreate()
spark
|
Output:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.3.1
Master
local[*]
AppName
Recommender
Step 2: Reading the data from the data set
Python3
data = spark.read.csv( 'book_ratings.csv' ,
inferSchema = True ,header = True )
data.show( 5 )
|
Output:
+-------+-------+------+
|book_id|user_id|rating|
+-------+-------+------+
| 1| 314| 5|
| 1| 439| 3|
| 1| 588| 5|
| 1| 1169| 4|
| 1| 1185| 4|
+-------+-------+------+
only showing top 5 rows
Describe the dataset
Output:
+-------+-----------------+------------------+------------------+
|summary| book_id| user_id| rating|
+-------+-----------------+------------------+------------------+
| count| 981756| 981756| 981756|
| mean|4943.275635697668|25616.759933221696|3.8565335989797873|
| stddev|2873.207414896143|15228.338825882149|0.9839408559619973|
| min| 1| 1| 1|
| max| 10000| 53424| 5|
+-------+-----------------+------------------+------------------+
Step 3: Splitting the data into training and testing
Python3
train_data, test_data = data.randomSplit([ 0.8 , 0.2 ])
|
Step 4: Import the Alternating Least Squares(ALS) Method and apply it.
Python3
als = ALS(maxIter = 5 ,
regParam = 0.01 ,
userCol = "user_id" ,
itemCol = "book_id" ,
ratingCol = "rating" )
model = als.fit(train_data)
|
Step 5: Predictions
Python3
predictions = model.transform(test_data)
predictions.show()
|
Output:
+-------+-------+------+----------+
|book_id|user_id|rating|prediction|
+-------+-------+------+----------+
| 2| 6342| 3| 4.8064413|
| 1| 17984| 5| 4.9681554|
| 1| 38475| 4| 4.4078903|
| 2| 6630| 5| 4.344222|
| 1| 32055| 4| 3.990228|
| 1| 33697| 4| 3.7945805|
| 1| 18313| 5| 4.533183|
| 1| 5461| 3| 3.8614116|
| 1| 47800| 5| 4.914357|
| 2| 10751| 3| 4.160536|
| 1| 16377| 4| 5.304298|
| 1| 45493| 5| 3.998557|
| 2| 10509| 2| 1.8626969|
| 1| 33890| 3| 3.6022692|
| 1| 37284| 5| 4.8147345|
| 1| 1185| 4| 3.7463336|
| 1| 44397| 5| 5.0251017|
| 1| 46977| 4| 4.0746284|
| 1| 10944| 5| 4.343548|
| 2| 8167| 2| 3.705464|
+-------+-------+------+----------+
only showing top 20 rows
Evaluations
Python3
evaluator = RegressionEvaluator(metricName = "rmse" , labelCol = "rating" ,predictionCol = "prediction" )
rmse = evaluator.evaluate(predictions)
print ( "Root-mean-square error = " + str (rmse))
|
Output:
Root-mean-square error = nan
Step 6: Recommendations
Now, we will predict/recommend the book to a single user – user1 (let’s say, userId:5461) with the help of our trained model.
Python3
user1 = test_data. filter (test_data[ 'user_id' ] = = 5461 ).select([ 'book_id' , 'user_id' ])
user1.show()
|
Output:
+-------+-------+
|book_id|user_id|
+-------+-------+
| 1| 5461|
| 11| 5461|
| 19| 5461|
| 46| 5461|
| 60| 5461|
| 66| 5461|
| 93| 5461|
| 111| 5461|
| 121| 5461|
| 172| 5461|
| 194| 5461|
| 212| 5461|
| 222| 5461|
| 245| 5461|
| 264| 5461|
| 281| 5461|
| 301| 5461|
| 354| 5461|
| 388| 5461|
| 454| 5461|
+-------+-------+
only showing top 20 rows
Python3
recommendations = model.transform(user1)
recommendations.orderBy( 'prediction' ,ascending = False ).show()
|
Output:
+-------+-------+----------+
|book_id|user_id|prediction|
+-------+-------+----------+
| 19| 5461| 5.3429904|
| 11| 5461| 4.830688|
| 66| 5461| 4.804107|
| 245| 5461| 4.705879|
| 388| 5461| 4.6276107|
| 1161| 5461| 4.612251|
| 60| 5461| 4.5895457|
| 1402| 5461| 4.5184|
| 1088| 5461| 4.454755|
| 5152| 5461| 4.415825|
| 121| 5461| 4.3423634|
| 93| 5461| 4.3357944|
| 1796| 5461| 4.30891|
| 172| 5461| 4.2679276|
| 454| 5461| 4.245925|
| 1211| 5461| 4.2431927|
| 731| 5461| 4.1873074|
| 1094| 5461| 4.1829815|
| 222| 5461| 4.182873|
| 264| 5461| 4.1469045|
+-------+-------+----------+
only showing top 20 rows
In the above output, there are predictions for the book IDs for the user with userId “5461”.
Step 7: Stop the spark
Last Updated :
09 May, 2023
Like Article
Save Article
Share your thoughts in the comments
Please Login to comment...