Open In App

Recommender System using Pyspark – Python

A recommender system is a type of information filtering system that provides personalized recommendations to users based on their preferences, interests, and past behaviors. Recommender systems come in a variety of forms, such as content-based, collaborative filtering, and hybrid systems. Content-based systems make recommendations for products based on how closely their characteristics match those of products the user has previously expressed interest in. Collaborative filtering systems recommend items based on the preferences of users who have similar interests to the user being recommended. Hybrid systems combine both content-based and collaborative filtering approaches to make recommendations.

We will implement this with the help of Collaborative Filtering. Collaborative filtering involves making predictions (filtering) about a user’s interests by compiling preferences or taste data from numerous users (collaborating). The essential premise is that, if two users A and B share the same opinion on a subject, A is more likely to share B’s opinion on a related but unrelated subject, x, than the opinion of a randomly selected user.



Recommender System using Pyspark 

Collaborative filtering is implemented by the machine learning library Spark MLlib using Alternating Least Squares. These parameters apply to the MLlib implementation:

In this, we will use the dataset of the book review.



Step 1: Import the necessary libraries and functions and Setup Spark Session




#importing the required pyspark library
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
  
#Setup Spark Session
spark = SparkSession.builder.appName('Recommender').getOrCreate()
spark

Output:

SparkSession - in-memory
SparkContext

Spark UI
Version
v3.3.1
Master
local[*]
AppName
Recommender

Step 2: Reading the data from the data set




#CSV file can be downloaded from the link mentioned above.
data = spark.read.csv('book_ratings.csv',
                      inferSchema=True,header=True)
  
data.show(5)

Output:

+-------+-------+------+
|book_id|user_id|rating|
+-------+-------+------+
|      1|    314|     5|
|      1|    439|     3|
|      1|    588|     5|
|      1|   1169|     4|
|      1|   1185|     4|
+-------+-------+------+
only showing top 5 rows

Describe the dataset




data.describe().show()

Output:

+-------+-----------------+------------------+------------------+
|summary|          book_id|           user_id|            rating|
+-------+-----------------+------------------+------------------+
|  count|           981756|            981756|            981756|
|   mean|4943.275635697668|25616.759933221696|3.8565335989797873|
| stddev|2873.207414896143|15228.338825882149|0.9839408559619973|
|    min|                1|                 1|                 1|
|    max|            10000|             53424|                 5|
+-------+-----------------+------------------+------------------+

Step 3: Splitting the data into training and testing




# Dividing the data using random split into train_data and test_data 
# in 80% and 20% respectively
train_data, test_data = data.randomSplit([0.8, 0.2])

Step 4: Import the Alternating Least Squares(ALS) Method and apply it.




# Build the recommendation model using ALS on the training data
als = ALS(maxIter=5,
          regParam=0.01,
          userCol="user_id",
          itemCol="book_id",
          ratingCol="rating")
  
#Fitting the model on the train_data
model = als.fit(train_data)

Step 5: Predictions




# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test_data)
  
#Displaying predictions calculated by the model
predictions.show()

Output:

+-------+-------+------+----------+
|book_id|user_id|rating|prediction|
+-------+-------+------+----------+
|      2|   6342|     3| 4.8064413|
|      1|  17984|     5| 4.9681554|
|      1|  38475|     4| 4.4078903|
|      2|   6630|     5|  4.344222|
|      1|  32055|     4|  3.990228|
|      1|  33697|     4| 3.7945805|
|      1|  18313|     5|  4.533183|
|      1|   5461|     3| 3.8614116|
|      1|  47800|     5|  4.914357|
|      2|  10751|     3|  4.160536|
|      1|  16377|     4|  5.304298|
|      1|  45493|     5|  3.998557|
|      2|  10509|     2| 1.8626969|
|      1|  33890|     3| 3.6022692|
|      1|  37284|     5| 4.8147345|
|      1|   1185|     4| 3.7463336|
|      1|  44397|     5| 5.0251017|
|      1|  46977|     4| 4.0746284|
|      1|  10944|     5|  4.343548|
|      2|   8167|     2|  3.705464|
+-------+-------+------+----------+
only showing top 20 rows

Evaluations




#Printing and calculating RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Output:

Root-mean-square error = nan

Step 6: Recommendations

Now, we will predict/recommend the book to a single user – user1 (let’s say, userId:5461) with the help of our trained model.




#Filtering user with user id "5461" with book id on which it has given the reviews
user1 = test_data.filter(test_data['user_id']==5461).select(['book_id','user_id'])
  
#Displaying user1 data
user1.show()

Output:

+-------+-------+
|book_id|user_id|
+-------+-------+
|      1|   5461|
|     11|   5461|
|     19|   5461|
|     46|   5461|
|     60|   5461|
|     66|   5461|
|     93|   5461|
|    111|   5461|
|    121|   5461|
|    172|   5461|
|    194|   5461|
|    212|   5461|
|    222|   5461|
|    245|   5461|
|    264|   5461|
|    281|   5461|
|    301|   5461|
|    354|   5461|
|    388|   5461|
|    454|   5461|
+-------+-------+
only showing top 20 rows




#Traning and evaluating for user1 with our model trained with the help of training data 
recommendations = model.transform(user1)
  
#Displaying the predictions of books for user1
recommendations.orderBy('prediction',ascending=False).show()

Output:

+-------+-------+----------+
|book_id|user_id|prediction|
+-------+-------+----------+
|     19|   5461| 5.3429904|
|     11|   5461|  4.830688|
|     66|   5461|  4.804107|
|    245|   5461|  4.705879|
|    388|   5461| 4.6276107|
|   1161|   5461|  4.612251|
|     60|   5461| 4.5895457|
|   1402|   5461|    4.5184|
|   1088|   5461|  4.454755|
|   5152|   5461|  4.415825|
|    121|   5461| 4.3423634|
|     93|   5461| 4.3357944|
|   1796|   5461|   4.30891|
|    172|   5461| 4.2679276|
|    454|   5461|  4.245925|
|   1211|   5461| 4.2431927|
|    731|   5461| 4.1873074|
|   1094|   5461| 4.1829815|
|    222|   5461|  4.182873|
|    264|   5461| 4.1469045|
+-------+-------+----------+
only showing top 20 rows

In the above output, there are predictions for the book IDs for the user with userId “5461”.

Step 7: Stop the spark 




spark.stop()


Article Tags :