In this article, we are going to make a project on the Tinder Match-Making Recommender system. Most social media platform have their own recommender system algorithms. In our project, which works like Tinder, we are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them. We are going to make this project from basic and the steps we are going to follow are as:
Importing Libraries
We will import all the libraries in one place so that we don’t have to import packages every time we use them. This practice will save us time and reserve our memory space.
- Numpy –  A Python library that is used for numerical mathematical computation and handling multidimensional ndarray, it also has a very large collection of mathematical functions to operate on this array Â
- Pandas – A Python library built on top of NumPy for effective matrix multiplication and dataframe manipulation, it is also used for data cleaning, data merging, data reshaping, and data aggregationÂ
- Matplotlib – It is used for plotting 2D and 3D visualization plots, it also supports a variety of output formats including graphsÂ
- Seaborn – Seaborn library is made on top of Matplotlib it is used for plotting beautiful plots.Â
Python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style( "darkgrid" ,
{ "grid.color" : ".6" ,
"grid.linestyle" : ":" })
import category_encoders as ce
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
|
We will use the panda read_csv() function to read our CSV file. You can download the respective dataset from here which has been used in this article for demonstration purpose.
Python3
tinder_df = pd.read_csv( "data.csv" )
|
After executing this function the dataset will be stored as a dataframe in the tinder_df variable. We can view the first five rows of the dataframe using tinder_df.head().
Exploratory Data Analysis of the Dataset
In exploratory data analysis(EDA), we try to gain essential pieces of information from the dataframe. EDA is considered to be one of the time-consuming parts of a data science project about 75% of our work will be in doing EDA of the dataset. However, we will see next that our effort will get justified in the end.
We will first see the dimension of our dataset using the panda shape() function. The output of this function will be a tuple having a total number of columns and rows.
output:
(2001, 22)
Next, we will use the info() function from the pandas to see the information about the dataset. The function will give Dtype and Non-Null counts of all the columns.
Output :
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2001 entries, 0 to 2000
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 user_id 2001 non-null object
1 username 2001 non-null object
2 age 2001 non-null int64
3 status 2001 non-null object
4 sex 2001 non-null object
5 orientation 2001 non-null object
6 drinks 2001 non-null object
7 drugs 2001 non-null object
8 height 2001 non-null float64
9 job 2001 non-null object
10 location 2001 non-null object
11 pets 2001 non-null object
12 smokes 2001 non-null object
13 language 2001 non-null object
14 new_languages 2001 non-null object
15 body_profile 2001 non-null object
16 education_level 2001 non-null float64
17 dropped_out 2001 non-null object
18 bio 2001 non-null object
19 interests 2001 non-null object
20 other_interests 2001 non-null object
21 location_preference 2001 non-null object
dtypes: float64(2), int64(1), object(19)
memory usage: 344.0+ KB
The function shows that the Dataset has a total of 2 float dtype columns 1 int dtype column and 19 object dtype columns. To see the total number of unique elements in each column. We will use the Pandas nunique() function. Â Â
Output:
user_id 2001
username 1995
age 52
status 4
sex 2
orientation 3
drinks 6
drugs 3
height 25
job 21
location 70
pets 15
smokes 5
language 575
new_languages 3
body_profile 12
education_level 5
dropped_out 2
bio 2001
interests 31
other_interests 31
location_preference 3
dtype: int64
Data Wrangling
In data wrangling, we process and transform the data to get the most useful and better structure out of it. To divide and summarize our dataset based on a column category. We will use the pandas groupby() methodÂ
Python3
tinder_df.groupby([ 'sex' , 'drugs' ])[ 'drugs' ] \
.count() \
.reset_index(name = 'unique_drug_count' )
|
Output:
sex drugs unique_drug_count
0 f never 711
1 f often 5
2 f sometimes 146
3 m never 875
4 m often 13
5 m sometimes 251
We can also group people based on their interest in learning new languages and college dropouts.
Python3
tinder_df.groupby([ 'new_languages' , 'dropped_out' ]) \
[ 'dropped_out' ].count(). \
reset_index(name = 'drop_out_people count' )
|
Output:
new_languages dropped_out drop_out_people count
0 interested no 594
1 interested yes 39
2 not interested no 999
3 not interested yes 51
4 somewhat interested no 305
5 somewhat interested yes 13
Data Visualization
Data visualization is an important part of storytelling. In data visualization, we make interactive plots using Python libraries to demonstrate the ideas which columns are trying to tell.
Python3
sns.histplot(tinder_df[ "age" ], kde = True )
|
Output:
 Histplot of age using seabornÂ
The age column has a long tail which shows it has a deviation from a normal distribution. Later we will apply some transformation to this age column to make it a normal distribution. Next, we will plot a histogram plot of the Height column.
Python3
sns.histplot(tinder_df[ "height" ], kde = True )
|
Output:
Histplot of Height column using seabornÂ
We can also plot a pie chart for the numerical data to see the percentage contribution in a certain range. we may be interested in knowing the percentage of people in a certain age range who are using Tinder. We will use the pandas cut() function to create bins for the numerical data.
Python3
plt.figure(figsize = ( 6 , 6 ))
bins = [ 18 , 30 , 40 , 50 , 60 , 70 ]
categories = pd.cut(tinder_df[ "age" ], bins,
labels = [ "18-30" , "30-40" ,
"40-50" , "50-60" , "60-70" ])
counts = categories.value_counts()
plt.pie(counts, labels = counts.index, autopct = '%1.1f%%' )
plt.show()
|
Output:
Pie chart for the percentage of age distribution
We can use the Histplot function from Seaborn to create a graph that shows the count of people in a particular job.
Python3
plt.figure(figsize = ( 6 , 6 ))
sns.histplot(x = "job" , data = tinder_df,
color = "coral" )
plt.xticks(rotation = 90 )
plt.title( "Distribution of job of each candidate" ,
fontsize = 14 )
plt.xlabel( "Job id" , fontsize = 12 )
plt.ylabel( "Count of people" , fontsize = 12 )
plt.show()
|
Output:
Count of people in a particular job using HistplotÂ
 Data ManipulationÂ
In data manipulation, we manipulate elements of the dataset accordingly for the purpose of modeling the data. Previously we have seen that the numerical data column age has a long right tail so basically, it is a  right-skewed column. Hence we will apply log transformation on this column to make it a normally distributed column.
To encode data from categorical object Dtype into numerical data we will use 3 types of encodings.
- One-Hot encoding – We will use this when there will be multiple categories in the column.
- Label encoding – This method will be used when there will be very fewer categories in the column.
- Binary encoding – This type of encoding is similar to on-hot encoding However it creates lesser new columns by encoding each category into binary digits.Â
There are several different transformations available for decreasing skewness like inverse transformation, square root transformation, log transformation, etc to apply. However, it depends upon us and column skewness to choose the right type of transformation.
We will handle each continuous variable and manipulate it to change in the corresponding numerical column.
Python3
tinder_df[ 'language' ]. str .contains( 'english' )\
.unique()
|
Output:
array([ True])
Since there are 571 unique rows in the language column and every row has English as the column language. It will create a very sparse matrix if we do one-hot encoding of the language column so we will create another column that counts the number of unique languages that a person knows and eventually, we drop the language column.
Python3
tinder_df[ 'num_languages' ] = tinder_df[ 'language' ]\
. str .count( ',' ) + 1
tinder_df.drop([ "language" ], axis = 1 , inplace = True )
|
To encode location preference we will assign a number to each location’s preferred place like anywhere giving the lowest preference to 1 and the same city having the highest preference to equal 2.5.
Python3
place_type_strength = {
'anywhere' : 1.0 ,
'same state' : 2.0 ,
'same city' : 2.5
}
tinder_df[ 'location_preference' ] = \
tinder_df[ 'location_preference' ]\
. apply ( lambda x: place_type_strength[x])
|
We can easily handle columns that have only two unique categorical values by label encoding.
Python3
two_unique_values_column = {
'sex' : { 'f' : 1 , 'm' : 0 },
'dropped_out' : { 'no' : 0 , 'yes' : 1 }
}
tinder_df.replace(two_unique_values_column,
inplace = True )
|
We will divide all four distinct elements into two parts.
- Â Either he is single or available.
- Either he is married or seeing someone higher weight is given to the people who are single or available.
Python3
status_type_strength = {
'single' : 2.0 ,
'available' : 2.0 ,
'seeing someone' : 1.0 ,
'married' : 1.0
}
tinder_df[ 'status' ] = tinder_df[ 'status' ]\
. apply ( lambda x:
status_type_strength[x])
|
Since orientation is an important element of the dataframe and it is Nominal Categorical Data so here we will do one hot encoding of this column.
Python3
orientation_encoder = LabelEncoder()
orientation_encoder.fit(tinder_df[ 'orientation' ])
tinder_df[ 'orientation' ] = orientation_encoder.\
transform(tinder_df[ 'orientation' ])
tinder_df.drop( "orientation" , axis = 1 , inplace = True )
|
In the drinking column 6 unique values. However, we can categorize these 6 values into three broader values. This way we only have to make three extra columns in One-Hot encoding.Â
Python3
drinking_habit = {
'socially' : 'sometimes' ,
'rarely' : 'sometimes' ,
'not at all' : 'do not drink' ,
'often' : 'drinks often' ,
'very often' : 'drinks often' ,
'desperately' : 'drinks often'
}
tinder_df[ 'drinks' ] = tinder_df[ 'drinks' ]\
. apply ( lambda x:
drinking_habit[x])
habit_encoder = LabelEncoder()
habit_encoder.fit(tinder_df[[ 'drinks' , 'drugs' ]]
.values.reshape( - 1 ))
tinder_df[ 'drinks_encoded' ] = \
habit_encoder.transform(tinder_df[ 'drinks' ])
tinder_df[ 'drugs_encoded' ] = \
habit_encoder.transform(tinder_df[ 'drugs' ])
tinder_df.drop([ "drinks" , "drugs" ], axis = 1 ,
inplace = True )
|
The location data has 70 unique columns, if we do one-hot encoding of this column it is going to create 70 more columns so here we will use our geographical knowledge to divide data into the broader regions.Â
Python3
region_dict = { 'southern_california' : [ 'los angeles' ,
'san diego' , 'hacienda heights' ,
'north hollywood' , 'phoenix' ],
'new_york' : [ 'brooklyn' ,
'new york' ]}
def get_region(city):
for region, cities in region_dict.items():
if city.lower() in [c.lower() for c in cities]:
return region
return "northern_california"
tinder_df[ 'location' ] = tinder_df[ 'location' ]\
. str .split( ', ' )\
. str [ 0 ]. apply (get_region)
location_encoder = OneHotEncoder()
location_encoded = location_encoder.fit_transform\
(tinder_df[[ 'location' ]])
location_encoded_df = pd.DataFrame(location_encoded.toarray()\
, columns = location_encoder.\
get_feature_names_out([ 'location' ]))
tinder_df = pd.concat([tinder_df, location_encoded_df], axis = 1 )
tinder_df.drop([ "location" ], axis = 1 , inplace = True )
|
Job is an important part of the individual identity we can not drop this column also we can not generalize these to broader categories so here we will do One-Hot encoding.
Python3
job_encoder = LabelEncoder()
job_encoder.fit(tinder_df[ 'job' ])
tinder_df[ 'job_encoded' ] = job_encoder.\
transform(tinder_df[ 'job' ])
tinder_df.drop( 'job' , axis = 1 , inplace = True )
|
We have 6 different types of people for smoking characteristics here we will change these to only two characteristics either they smoke or they do not smoke.
Python3
smokes = {
'no' : 1.0 ,
'sometimes' : 0 ,
'yes' : 0 ,
'when drinking' : 0 ,
'trying to quit' : 0
}
tinder_df[ 'smokes' ] = tinder_df[ 'smokes' ]\
. apply ( lambda x: smokes[x])
|
For the pets column, we will do Binary encoding.
Python3
bin_enc = ce.BinaryEncoder(cols = [ 'pets' ])
pet_enc = bin_enc.fit_transform(tinder_df[ 'pets' ])
tinder_df = pd.concat([tinder_df, pet_enc], axis = 1 )
tinder_df.drop( "pets" ,axis = 1 ,inplace = True )
|
For the new_language and body_profile columns, we will simply do One-Hot encoding.Â
Python3
location_encoder = LabelEncoder()
location_encoder.fit(tinder_df[ 'new_languages' ])
tinder_df[ 'new_languages' ] = location_encoder.transform(
tinder_df[ 'new_languages' ])
le = LabelEncoder()
tinder_df[ "body_profile" ] = le.fit_transform(tinder_df[ "body_profile" ])
|
Data ModellingÂ
In data modeling, we will first use TfidfVectorizer from the sklearn package to convert bio-categorical object Dtype into the numerical column. Note that output from the tfidVectorizer is a sparse matrix so here we will use SVD (Singular Value Decomposition) to reduce the dimensionality of the matrix.
For the purpose of finding a similarity between the user and our current present profile, we will use cosine similarity between the user and stored profile.
This is a content-based filtering algorithm in which we are using the user’s profile information to recommend other profiles with similar characteristics. This algorithm recommends the profiles which have the highest cosine similarity score with the user.
Python3
tfidf = TfidfVectorizer(stop_words = 'english' )
tfidf_matrix = tfidf.fit_transform(tinder_df[ 'bio' ])
feature_names = tfidf.vocabulary_
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(),
columns = feature_names)
tinder_dfs = tinder_df.drop([ "bio" , "user_id" ,
"username" ], axis = 1 )
tinder_dfs = pd.concat([tinder_dfs,
tfidf_df], axis = 1 )
svd = TruncatedSVD(n_components = 100 )
svd_matrix = svd.fit_transform(tinder_dfs)
cosine_sim = cosine_similarity(svd_matrix)
|
Model Prediction
To get recommendations for the new user we will define a new recommend function.
Python3
def recommend(user_df, num_recommendations = 5 ):
svd_matrixs = svd.transform(user_df)
cosine_sim_new = cosine_similarity(svd_matrixs, svd_matrix)
sim_scores = list ( enumerate (cosine_sim_new[ 0 ]))
sim_scores = sorted (sim_scores,
key = lambda x: x[ 1 ], reverse = True )
sim_indices = [i[ 0 ] for i in
sim_scores[ 1 :num_recommendations + 1 ]]
return tinder_df[ 'username' ].iloc[sim_indices]
|
Next, we will take input from the user and convert it into a dataframe so that we can use this information to make new predictions.
Python3
user_df = {}
user_df[ 'age' ] = float ( input ( "Enter age: " ))
user_df[ 'status' ] = float ( input ( "Enter status: " ))
user_df[ 'sex' ] = float ( input ("Enter sex \
( 0 for female, 1 for male): "))
user_df[ 'height' ] = float ( input ("Enter \
height in inches: "))
user_df[ 'smokes' ] = float ( input ("Enter smokes\
( 0 for no, 1 for yes): "))
user_df[ 'new_languages' ] = float (
input ("Enter number of new \
languages learned: "))
user_df[ 'body_profile' ] = float ( input ("Enter body \
profile ( 0 - 1 )"))
user_df[ 'education_level' ] = float ( input ("Enter \
education level ( 1 - 5 ): "))
user_df[ 'dropped_out' ] = float (
input ( "Enter dropped out (0 for no, 1 for yes): " ))
user_df[ 'bio' ] = [ input ( "Enter bio: " )]
user_df[ 'location_preference' ] = float (
input ( "Enter location preference (0-2): " ))
user_df[ 'num_languages' ] = float ( input ("\
Enter number of languages known: "))
user_df[ 'drinks_encoded' ] = float ( input ("\
Enter drinks encoded ( 0 - 3 ): "))
user_df[ 'drugs_encoded' ] = float ( input ("\
Enter drugs encoded ( 0 - 2 ): "))
user_df[ 'location_new_york' ] = float (
input ( "Enter location_new_york (0 or 1): " ))
user_df[ 'location_northern_california' ] = float (
input ( "Enter location_northern_california (0 or 1): " ))
user_df[ 'location_southern_california' ] = float (
input ( "Enter location_southern_california (0 or 1): " ))
user_df[ 'job_encoded' ] = float ( input ("\
Enter job encoded ( 0 - 9 ): "))
user_df[ 'pets_0' ] = float ( input ("\
Enter pets_0 ( 0 or 1 ): "))
user_df[ 'pets_1' ] = float ( input ("\
Enter pets_1 ( 0 or 1 ): "))
user_df[ 'pets_2' ] = float ( input ("\
Enter pets_2 ( 0 or 1 ): "))
user_df[ 'pets_3' ] = float ( input ("\
Enter pets_3 ( 0 or 1 ): "))
tfidf_df = pd.DataFrame(tfidf.transform(
user_df[ 'bio' ]).toarray(), columns = feature_names)
user_df = pd.DataFrame(user_df, index = [ 0 ])
user_df.drop( "bio" , axis = 1 , inplace = True )
user_df = pd.concat([user_df, tfidf_df], axis = 1 )
|
Output:
Enter age: 22
Enter status: 1
Enter sex (0 for female, 1 for male): 1
Enter height in inches: 60
Enter smokes 0 for no, 1 for yes): 0
Enter number of new languages learned: 2
Enter body profile (0-1)1
Enter education level (1-5): 4
Enter dropped out (0 for no, 1 for yes): 1
Enter bio: I am a foodie and traveller. But sometimes like to sit alone in a
corner and read a good fiction.
Enter location preference (0-2): 2
Enter number of languages known: 2
Enter drinks encoded (0-3): 0
Enter drugs encoded (0-2): 0
Enter location_new_york (0 or 1): 0
Enter location_northern_california (0 or 1): 1
Enter location_southern_california (0 or 1): 0
Enter job encoded (0-9): 4
Enter pets_0 (0 or 1): 0
Enter pets_1 (0 or 1): 0
Enter pets_2 (0 or 1): 0
Enter pets_3 (0 or 1): 0
Call function to print the recommended user.
Python3
print (recommend(user_df))
|
Output:
23 Ronald Millwood
550 Terry Ostrov
1685 Thomas Moran
1044 Travis Pergande
241 Carol Valente
Name: username, dtype: object
This is a very basic content-based recommender system but there are multiple models which are based on Deep Learning and work really well when provided to the real-world dataset. Â
Share your thoughts in the comments
Please Login to comment...