Open In App

Predict Tinder Matches with Machine Learning

Improve
Improve
Like Article
Like
Save
Share
Report

In this article, we are going to make a project on the Tinder Match-Making Recommender system. Most social media platform have their own recommender system algorithms. In our project, which works like Tinder, we are going to make a recommender algorithm that recommends profiles to people based on their similar interests so we will aim to predict the profiles to the user such that the user finds it most interesting out of all and tries to connect with them. We are going to make this project from basic and the steps we are going to follow are as:

Importing Libraries

We will import all the libraries in one place so that we don’t have to import packages every time we use them. This practice will save us time and reserve our memory space.

  • Numpy  –  A Python library that is used for numerical mathematical computation and handling multidimensional ndarray, it also has a very large collection of mathematical functions to operate on this array  
  • Pandas – A Python library built on top of NumPy for effective matrix multiplication and dataframe manipulation, it is also used for data cleaning, data merging, data reshaping, and data aggregation 
  • Matplotlib – It is used for plotting 2D and 3D visualization plots, it also supports a variety of output formats including graphs 
  • Seaborn  – Seaborn library is made on top of Matplotlib it is used for plotting beautiful plots. 

Python3




import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
sns.set_style("darkgrid",
              {"grid.color": ".6",
               "grid.linestyle": ":"})
import category_encoders as ce
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


We will use the panda read_csv() function to read our CSV file. You can download the respective dataset from here which has been used in this article for demonstration purpose.

Python3




# reading dataset using panda
tinder_df = pd.read_csv("data.csv")


After executing this function the dataset will be stored as a dataframe in the tinder_df variable. We can view the first five rows of the dataframe using tinder_df.head().

Exploratory Data Analysis of the Dataset

In exploratory data analysis(EDA), we try to gain essential pieces of information from the dataframe. EDA is considered to be one of the time-consuming parts of a data science project about 75% of our work will be in doing EDA of the dataset. However, we will see next that our effort will get justified in the end.

We will first see the dimension of our dataset using the panda shape() function. The output of this function will be a tuple having a total number of columns and rows.

Python3




# shape of the dataset
print(tinder_df.shape)


output:

(2001, 22)

Next, we will use the info() function from the pandas to see the information about the dataset. The function will give Dtype and Non-Null counts of all the columns.

Python3




# information about the dataset
tinder_df.info()


Output :

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2001 entries, 0 to 2000
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              2001 non-null   object 
 1   username             2001 non-null   object 
 2   age                  2001 non-null   int64  
 3   status               2001 non-null   object 
 4   sex                  2001 non-null   object 
 5   orientation          2001 non-null   object 
 6   drinks               2001 non-null   object 
 7   drugs                2001 non-null   object 
 8   height               2001 non-null   float64
 9   job                  2001 non-null   object 
 10  location             2001 non-null   object 
 11  pets                 2001 non-null   object 
 12  smokes               2001 non-null   object 
 13  language             2001 non-null   object 
 14  new_languages        2001 non-null   object 
 15  body_profile         2001 non-null   object 
 16  education_level      2001 non-null   float64
 17  dropped_out          2001 non-null   object 
 18  bio                  2001 non-null   object 
 19  interests            2001 non-null   object 
 20  other_interests      2001 non-null   object 
 21  location_preference  2001 non-null   object 
dtypes: float64(2), int64(1), object(19)
memory usage: 344.0+ KB

The function shows that the Dataset has a total of 2 float dtype columns 1 int dtype column and 19 object dtype columns. To see the total number of unique elements in each column. We will use the Pandas nunique() function.   

Python3




# Number of unique element in the columns
tinder_df.nunique()


Output:

user_id                2001
username               1995
age                      52
status                    4
sex                       2
orientation               3
drinks                    6
drugs                     3
height                   25
job                      21
location                 70
pets                     15
smokes                    5
language                575
new_languages             3
body_profile             12
education_level           5
dropped_out               2
bio                    2001
interests                31
other_interests          31
location_preference       3
dtype: int64

Data Wrangling

In data wrangling, we process and transform the data to get the most useful and better structure out of it. To divide and summarize our dataset based on a column category. We will use the pandas groupby() method 

Python3




tinder_df.groupby(['sex', 'drugs'])['drugs'] \
    .count() \
    .reset_index(name='unique_drug_count')


Output:

    sex    drugs    unique_drug_count
0    f    never        711
1    f    often        5
2    f    sometimes    146
3    m    never        875
4    m    often        13
5    m    sometimes    251

We can also group people based on their interest in learning new languages and college dropouts.

Python3




tinder_df.groupby(['new_languages', 'dropped_out']) \
            ['dropped_out'].count(). \
            reset_index(name='drop_out_people count')


Output:

new_languages    dropped_out    drop_out_people count
0    interested            no                 594
1    interested            yes                 39
2    not interested        no                 999
3    not interested        yes                 51
4    somewhat interested    no                 305
5    somewhat interested    yes                 13

Data Visualization

Data visualization is an important part of storytelling. In data visualization, we make interactive plots using Python libraries to demonstrate the ideas which columns are trying to tell.

Python3




# distribution of age
sns.histplot(tinder_df["age"], kde=True)


Output:

Histplot of  age  using seaborn

 Histplot of age using seaborn 

The age column has a long tail which shows it has a deviation from a normal distribution. Later we will apply some transformation to this age column to make it a normal distribution. Next, we will plot a histogram plot of the Height column.

Python3




# Distribution of height
sns.histplot(tinder_df["height"], kde=True)


Output:

Histplot of Height column using seaborn

Histplot of Height column using seaborn 

We can also plot a pie chart for the numerical data to see the percentage contribution in a certain range. we may be interested in knowing the percentage of people in a certain age range who are using Tinder. We will use the pandas cut() function to create bins for the numerical data.

Python3




# Set the size of the figure to 10 inches
# wide by 8 inches tall
plt.figure(figsize=(6, 6))
  
# Divide the data into categories
bins = [18, 30, 40, 50, 60, 70]
  
# Use the `cut` function to assign
# each data point to a category
categories = pd.cut(tinder_df["age"], bins,
                    labels=["18-30", "30-40",
                            "40-50", "50-60", "60-70"])
  
# Count the number of data points in each category
counts = categories.value_counts()
  
# Plot the data as a pie chart
plt.pie(counts, labels=counts.index, autopct='%1.1f%%')
plt.show()


Output:

pie chart for  percentage of age distribution

Pie chart for the percentage of age distribution

We can use the Histplot function from Seaborn to create a graph that shows the count of people in a particular job.

Python3




plt.figure(figsize=(6, 6))
sns.histplot(x="job", data=tinder_df,
             color="coral")
  
# rotate x-axis labels vertically
plt.xticks(rotation=90)
plt.title("Distribution of job of each candidate",
          fontsize=14)
  
plt.xlabel("Job id", fontsize=12)
plt.ylabel("Count of people", fontsize=12)
  
plt.show()


Output:

Count of people in a particular job using Histplot

Count of people in a particular job using Histplot 

 Data Manipulation 

In data manipulation, we manipulate elements of the dataset accordingly for the purpose of modeling the data. Previously we have seen that the numerical data column age has a long right tail so basically, it is a  right-skewed column. Hence we will apply log transformation on this column to make it a normally distributed column.

To encode data from categorical object Dtype into numerical data we will use 3 types of encodings.

  1. One-Hot encoding – We will use this when there will be multiple categories in the column.
  2. Label encoding – This method will be used when there will be very fewer categories in the column.
  3. Binary encoding – This type of encoding is similar to on-hot encoding However it creates lesser new columns by encoding each category into binary digits. 

There are several different transformations available for decreasing skewness like inverse transformation, square root transformation, log transformation, etc to apply. However, it depends upon us and column skewness to choose the right type of transformation.

We will handle each continuous variable and manipulate it to change in the corresponding numerical column.

Python3




# check if every row has a
# common language as english
tinder_df['language'].str.contains('english')\
    .unique()


Output:

array([ True])

Since there are 571 unique rows in the language column and every row has English as the column language. It will create a very sparse matrix if we do one-hot encoding of the language column so we will create another column that counts the number of unique languages that a person knows and eventually, we drop the language column.

Python3




# count the number of languages in each row
tinder_df['num_languages'] = tinder_df['language']\
    .str.count(',') + 1
tinder_df.drop(["language"], axis=1, inplace=True)


To encode location preference we will assign a number to each location’s preferred place like anywhere giving the lowest preference to 1 and the same city having the highest preference to equal 2.5.

Python3




place_type_strength = {
    'anywhere': 1.0,
    'same state': 2.0,
    'same city': 2.5
}
  
tinder_df['location_preference'] = \
    tinder_df['location_preference']\
    .apply(lambda x: place_type_strength[x])


We can easily handle columns that have only two unique categorical values by label encoding.

Python3




two_unique_values_column = {
    'sex': {'f': 1, 'm': 0},
    'dropped_out': {'no': 0, 'yes': 1}
}
  
tinder_df.replace(two_unique_values_column,
                  inplace=True)


We will divide all four distinct elements into two parts.

  1.  Either he is single or available.
  2. Either he is married or seeing someone higher weight is given to the people who are single or available.

Python3




status_type_strength = {
    'single': 2.0,
    'available': 2.0,
    'seeing someone': 1.0,
    'married': 1.0
}
tinder_df['status'] = tinder_df['status']\
    .apply(lambda x:
           status_type_strength[x])


Since orientation is an important element of the dataframe and it is Nominal Categorical Data so here we will do one hot encoding of this column.

Python3




# create a LabelEncoder object
orientation_encoder = LabelEncoder()
  
# fit the encoder on the orientation column
orientation_encoder.fit(tinder_df['orientation'])
  
# encode the orientation column using the fitted encoder
tinder_df['orientation'] = orientation_encoder.\
    transform(tinder_df['orientation'])
  
# Drop the existing orientation column
tinder_df.drop("orientation", axis=1, inplace=True)


In the drinking column 6 unique values. However, we can categorize these 6 values into three broader values. This way we only have to make three extra columns in One-Hot encoding. 

Python3




drinking_habit = {
    'socially': 'sometimes',
    'rarely': 'sometimes',
    'not at all': 'do not drink',
    'often': 'drinks often',
    'very often': 'drinks often',
    'desperately': 'drinks often'
}
tinder_df['drinks'] = tinder_df['drinks']\
    .apply(lambda x:
           drinking_habit[x])
# create a LabelEncoder object
habit_encoder = LabelEncoder()
  
# fit the encoder on the drinks and drugs columns
habit_encoder.fit(tinder_df[['drinks', 'drugs']]
                  .values.reshape(-1))
  
# encode the drinks and drugs columns
# using the fitted encoder
tinder_df['drinks_encoded'] = \
    habit_encoder.transform(tinder_df['drinks'])
tinder_df['drugs_encoded'] = \
    habit_encoder.transform(tinder_df['drugs'])
  
# Drop the existing drink and drugs column
tinder_df.drop(["drinks", "drugs"], axis=1,
               inplace=True)


The location data has 70 unique columns, if we do one-hot encoding of this column it is going to create 70 more columns so here we will use our geographical knowledge to divide data into the broader regions. 

Python3




region_dict = {'southern_california': ['los angeles',
                         'san diego', 'hacienda heights',
                         'north hollywood', 'phoenix'],
               'new_york': ['brooklyn',
                            'new york']}
  
def get_region(city):
    for region, cities in region_dict.items():
        if city.lower() in [c.lower() for c in cities]:
            return region
    return "northern_california"
  
  
tinder_df['location'] = tinder_df['location']\
           .str.split(', ')\
          .str[0].apply(get_region)
# perform one hot encoding
location_encoder = OneHotEncoder()
  
# fit and transform the location column
location_encoded = location_encoder.fit_transform\
                       (tinder_df[['location']])
  
# create a new DataFrame with the encoded columns
location_encoded_df = pd.DataFrame(location_encoded.toarray()\
                         , columns=location_encoder.\
                           get_feature_names_out(['location']))
  
# concatenate the new DataFrame with the original DataFrame
tinder_df = pd.concat([tinder_df, location_encoded_df], axis=1)
# Drop the existing location column
tinder_df.drop(["location"], axis=1, inplace=True)


Job is an important part of the individual identity we can not drop this column also we can not generalize these to broader categories so here we will do One-Hot encoding.

Python3




# create a LabelEncoder object
job_encoder = LabelEncoder()
  
# fit the encoder on the job column
job_encoder.fit(tinder_df['job'])
  
# encode the job column using the fitted encoder
tinder_df['job_encoded'] = job_encoder.\
    transform(tinder_df['job'])
  
# drop the original job column
tinder_df.drop('job', axis=1, inplace=True)


We have 6 different types of people for smoking characteristics here we will change these to only two characteristics either they smoke or they do not smoke.

Python3




smokes = {
   'no': 1.0,
   'sometimes': 0
   'yes': 0,
   'when drinking':0,
   'trying to quit':0
}
tinder_df['smokes'] = tinder_df['smokes']\
                             .apply(lambda x: smokes[x])


For the pets column, we will do Binary encoding.

Python3




bin_enc = ce.BinaryEncoder(cols=['pets'])
  
# fit and transform the pet column
pet_enc = bin_enc.fit_transform(tinder_df['pets'])
  
# add the encoded columns to the original dataframe
tinder_df = pd.concat([tinder_df, pet_enc], axis=1)
  
tinder_df.drop("pets",axis=1,inplace = True)


For the new_language and body_profile columns, we will simply do One-Hot encoding. 

Python3




# create a LabelEncoder object
location_encoder = LabelEncoder()
  
# fit the encoder on the job column
location_encoder.fit(tinder_df['new_languages'])
  
# encode the job column using the fitted encoder
tinder_df['new_languages'] = location_encoder.transform(
    tinder_df['new_languages'])
  
# create an instance of LabelEncoder
le = LabelEncoder()
  
# encode the body_profile column
tinder_df["body_profile"] = le.fit_transform(tinder_df["body_profile"])


Data Modelling 

In data modeling, we will first use TfidfVectorizer from the sklearn package to convert bio-categorical object Dtype into the numerical column. Note that output from the tfidVectorizer is a sparse matrix so here we will use SVD (Singular Value Decomposition) to reduce the dimensionality of the matrix.

For the purpose of finding a similarity between the user and our current present profile, we will use cosine similarity between the user and stored profile.

This is a content-based filtering algorithm in which we are using the user’s profile information to recommend other profiles with similar characteristics. This algorithm recommends the profiles which have the highest cosine similarity score with the user.

Python3




# Initialize TfidfVectorizer object
tfidf = TfidfVectorizer(stop_words='english')
  
# Fit and transform the text data
tfidf_matrix = tfidf.fit_transform(tinder_df['bio'])
  
# Get the feature names from the TfidfVectorizer object
feature_names = tfidf.vocabulary_
  
# Convert tfidf matrix to DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(),
                        columns=feature_names)
  
# Add non-text features to the tfidf_df dataframe
tinder_dfs = tinder_df.drop(["bio", "user_id",
                             "username"], axis=1)
tinder_dfs = pd.concat([tinder_dfs,
                        tfidf_df], axis=1)
# Apply SVD to the feature matrix
svd = TruncatedSVD(n_components=100)
svd_matrix = svd.fit_transform(tinder_dfs)
  
# Calculate the cosine similarity
# between all pairs of users
cosine_sim = cosine_similarity(svd_matrix)


Model Prediction

To get recommendations for the new user we will define a new recommend function.

Python3




def recommend(user_df, num_recommendations=5):
  
    # Apply SVD to the feature
    # matrix of the user_df dataframe
    svd_matrixs = svd.transform(user_df)
  
    # Calculate the cosine similarity
    # between the user_df and training set users
    cosine_sim_new = cosine_similarity(svd_matrixs, svd_matrix)
  
    # Get the indices of the top
    # num_recommendations similar users
    sim_scores = list(enumerate(cosine_sim_new[0]))
    sim_scores = sorted(sim_scores,
                        key=lambda x: x[1], reverse=True)
    sim_indices = [i[0] for i in
                   sim_scores[1:num_recommendations+1]]
  
    # Return the user_ids of the recommended users
    return tinder_df['username'].iloc[sim_indices]


Next, we will take input from the user and convert it into a dataframe so that we can use this information to make new predictions.

Python3




user_df = {}
  
# Get user input for numerical columns
user_df['age'] = float(input("Enter age: "))
user_df['status'] = float(input("Enter status: "))
user_df['sex'] = float(input("Enter sex \
              (0 for female, 1 for male): "))
user_df['height'] = float(input("Enter \
                height in inches: "))
user_df['smokes'] = float(input("Enter smokes\
                  (0 for no, 1 for yes): "))
user_df['new_languages'] = float(
    input("Enter number of new \
         languages learned: "))
user_df['body_profile'] = float(input("Enter body \
              profile (0-1)"))
user_df['education_level'] = float(input("Enter \
              education level (1-5): "))
user_df['dropped_out'] = float(
    input("Enter dropped out (0 for no, 1 for yes): "))
user_df['bio'] = [input("Enter bio: ")]
user_df['location_preference'] = float(
    input("Enter location preference (0-2): "))
user_df['num_languages'] = float(input("\
               Enter number of languages known: "))
user_df['drinks_encoded'] = float(input("\
               Enter drinks encoded (0-3): "))
user_df['drugs_encoded'] = float(input("\
                  Enter drugs encoded (0-2): "))
  
# Get user input for one-hot encoded categorical columns
user_df['location_new_york'] = float(
    input("Enter location_new_york (0 or 1): "))
user_df['location_northern_california'] = float(
    input("Enter location_northern_california (0 or 1): "))
user_df['location_southern_california'] = float(
    input("Enter location_southern_california (0 or 1): "))
user_df['job_encoded'] = float(input("\
               Enter job encoded (0-9): "))
user_df['pets_0'] = float(input("\
                Enter pets_0 (0 or 1): "))
user_df['pets_1'] = float(input("\
                  Enter pets_1 (0 or 1): "))
user_df['pets_2'] = float(input("\
               Enter pets_2 (0 or 1): "))
user_df['pets_3'] = float(input("\
                  Enter pets_3 (0 or 1): "))
  
# Convert tfidf matrix to DataFrame
tfidf_df = pd.DataFrame(tfidf.transform(
    user_df['bio']).toarray(), columns=feature_names)
  
# Convert the user input
# dictionary to a Pandas DataFrame
user_df = pd.DataFrame(user_df, index=[0])
user_df.drop("bio", axis=1, inplace=True)
user_df = pd.concat([user_df, tfidf_df], axis=1)


Output:

Enter age: 22
Enter status: 1
Enter sex (0 for female, 1 for male): 1
Enter height in inches: 60
Enter smokes 0 for no, 1 for yes): 0
Enter number of new languages learned: 2
Enter body profile (0-1)1
Enter education level (1-5): 4
Enter dropped out (0 for no, 1 for yes): 1
Enter bio: I am a foodie and traveller. But sometimes like to sit alone in a 
corner and read a good fiction.
Enter location preference (0-2): 2
Enter number of languages known: 2
Enter drinks encoded (0-3): 0
Enter drugs encoded (0-2): 0
Enter location_new_york (0 or 1): 0
Enter location_northern_california (0 or 1): 1
Enter location_southern_california (0 or 1): 0
Enter job encoded (0-9): 4
Enter pets_0 (0 or 1): 0
Enter pets_1 (0 or 1): 0
Enter pets_2 (0 or 1): 0
Enter pets_3 (0 or 1): 0

Call function to print the recommended user.

Python3




print(recommend(user_df))


Output:

23      Ronald Millwood
550        Terry Ostrov
1685       Thomas Moran
1044    Travis Pergande
241       Carol Valente
Name: username, dtype: object

This is a very basic content-based recommender system but there are multiple models which are based on Deep Learning and work really well when provided to the real-world dataset.  



Last Updated : 06 May, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads