Open In App

Music Recommendation System Using Machine Learning

Last Updated : 01 Nov, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

When did we see a video on youtube let’s say it was funny then the next time you open your youtube app you get recommendations of some funny videos in your feed ever thought about how? This is nothing but an application of Machine Learning using which recommender systems are built to provide personalized experience and increase customer engagement.

In this article, we will try to build a very basic recommender system that can recommend songs based on which songs you hear.

Importing Libraries & Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

  • Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
  • Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
  • Matplotlib/Seaborn – This library is used to draw visualizations.
  • Sklearn – This module contains multiple libraries having pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.

Python3

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
 
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
 
import warnings
warnings.filterwarnings('ignore')

                    

The dataset we are going to use contains data about songs released in the span of around 100 years. Along with some general information about songs some scientific measures of sound are also provided like loudness, acoustics, speechiness, and so on.

Python3

tracks = pd.read_csv('tracks_records.csv')
tracks.head()

                    

Output:

First five rows of the dataset

First five rows of the dataset

Data Cleaning

Data Cleaning is one of the important steps without which data will be of no use because the raw data contains a lot of noises that must be removed else the observations made from it will be inaccurate and if we are building a model upon it then it’s performance will be poor as well. Steps included in the data cleaning are outlier removal, null value imputation, and fixing the skewness of the data.

Python3

tracks.shape

                    

Output:

(586672, 19)

Python3

tracks.info()

                    

Output:

Basic information about the columns of dataset

Basic information about the columns of the dataset

Now. let’s check if there are null values in the columns of our data frame.

Python3

tracks.isnull().sum()

                    

Output:

Number of null values in each column

Number of null values in each column

The genre of music is a very important indicator of the type of music which is why we will remove such rows with null values. We could have imputed then as well but we have a huge dataset of around  6 lakh rows so, removing 50,000 won’t affect much (depending upon the case).

Python3

tracks.dropna(inplace = True)
tracks.isnull().sum().plot.bar()
plt.show()

                    

Output:

After removing rows containing null values

After removing rows containing null values

Now let’s remove some columns which we won’t be using to build our recommender system.

Python3

tracks = tracks.drop(['id', 'id_artists'], axis = 1)

                    

Exploratory Data Analysis

EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. 

The dataset we have contains around 14 numerical columns but we cannot visualize such high-dimensional data. But to solve this problem t-SNE comes to the rescue. t-SNE is an algorithm that can convert high dimensional data to low dimensions and uses some non-linear method to do so which is not a concern of this article.

Python3

model = TSNE(n_components = 2, random_state = 0)
tsne_data = model.fit_transform(a.head(500))
plt.figure(figsize = (7, 7))
plt.scatter(tsne_data[:,0], tsne_data[:,1])
plt.show()

                    

Output:

Scatter plot of the output of t-SNE

Scatter plot of the output of t-SNE

Here we can observe some clusters.

Formation of clusters in 2-D space

Formation of clusters in 2-D space

As we know multiple versions of the same song are released hence we need to remove the different versions of the same sone as we are building a content-based recommender system behind which the main worker is the cosine similarity function our system will recommend the versions of the same song if available and that is not what we want.

Python3

tracks['name'].nunique(), tracks.shape

                    

Output:

(408902, (536847, 17))

So, our concern was right so, let’s remove the duplicate rows based upon the song names.

Python3

tracks = tracks.sort_values(by=['popularity'], ascending=False)
tracks.drop_duplicates(subset=['name'], keep='first', inplace=True)

                    

Let’s visualize the number of songs released each year.

Python3

plt.figure(figsize = (10, 5))
sb.countplot(tracks['release_year'])
plt.axis('off')
plt.show()

                    

Output:

Countplot of the number of songs in subsequent years

Countplot of the number of songs in subsequent years 

Here we can see a boom in the music industry from the year 1900 to somewhere around 1990.

Python3

floats = []
for col in tracks.columns:
  if tracks[col].dtype == 'float':
    floats.append(col)
 
len(floats)

                    

Output:

10

There is a total of 10 such columns with float values in them. Let’s draw their distribution plot to get insights into the distribution of the data.

Python3

plt.subplots(figsize = (15, 5))
for i, col in enumerate(floats):
  plt.subplot(2, 5, i + 1)
  sb.distplot(tracks[col])
plt.tight_layout()
plt.show()

                    

Output:

Distribution plot of the continuous features

Distribution plot of the continuous features

Some of the features have normal distribution while some data distribution is skewed as well.

Python3

%%capture
song_vectorizer = CountVectorizer()
song_vectorizer.fit(tracks['genres'])

                    

As the dataset is too large computation cost/time will to too high so, we will show the implementation of the recommended system by using the most popular 10,000 songs.

Python3

tracks = tracks.sort_values(by=['popularity'], ascending=False).head(10000)

                    

Below is a helper function to get similarities for the input song with each song in the dataset.

Python3

def get_similarities(song_name, data):
   
  # Getting vector for the input song.
  text_array1 = song_vectorizer.transform(data[data['name']==song_name]['genres']).toarray()
  num_array1 = data[data['name']==song_name].select_dtypes(include=np.number).to_numpy()
   
  # We will store similarity for each row of the dataset.
  sim = []
  for idx, row in data.iterrows():
    name = row['name']
     
    # Getting vector for current song.
    text_array2 = song_vectorizer.transform(data[data['name']==name]['genres']).toarray()
    num_array2 = data[data['name']==name].select_dtypes(include=np.number).to_numpy()
 
    # Calculating similarities for text as well as numeric features
    text_sim = cosine_similarity(text_array1, text_array2)[0][0]
    num_sim = cosine_similarity(num_array1, num_array2)[0][0]
    sim.append(text_sim + num_sim)
     
  return sim

                    

To calculate the similarity between the two vectors we have used the concept of cosine similarity.

sim\left ( X, Y \right ) = \frac{X \cdot Y}{\left\| X\right\|\left\| Y\right\|}

Python3

def recommend_songs(song_name, data=tracks):
  # Base case
  if tracks[tracks['name'] == song_name].shape[0] == 0:
    print('This song is either not so popular or you\
    have entered invalid_name.\n Some songs you may like:\n')
     
    for song in data.sample(n=5)['name'].values:
      print(song)
    return
   
  data['similarity_factor'] = get_similarities(song_name, data)
 
  data.sort_values(by=['similarity_factor', 'popularity'],
                   ascending = [False, False],
                   inplace=True)
   
  # First song will be the input song itself as the similarity will be highest.
  display(data[['name', 'artists']][2:7])

                    

Now, it’s time to see the recommender system at work. Let’s see which songs are recommender system will recommend if he/she listens to the famous song ‘Shape of you’.

Python3

recommend_songs('Shape of You')

                    

Output:

Recommended songs if you hear 'Shape of you'

Recommended songs if you hear ‘Shape of you’

Let’s try this on one more song.

Python3

recommend_songs('Love Someone')

                    

Output:

Recommended songs if you hear 'Love Someone'

Recommended songs if you hear ‘Love Someone’

Below shown is the case if the song name entered is incorrect.

Python3

recommend_songs('Love me like you do')

                    

Output:

If the input song name is not in the dataset

If the input song name is not in the dataset

Conclusion

Although this model requires a lot of changes before it can be used in any real-world music app or website. But this is just an overview of how recommendation systems are built and used.



Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads