Open In App

Visualising ML DataSet Through Seaborn Plots and Matplotlib

Last Updated : 08 Dec, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

Working on data can sometimes be a bit boring. Transforming a raw data into an understandable format is one of the most essential part of the whole process, then why to just stick around on numbers, when we can visualize our data into mind-blowing graphs which are up for grabs in python. This article will focus on exploring plots which could make your preprocessing journey, intriguing.

Seaborn and Matplotlib provide us with numerous alluring graphs through which one can easily analyze weak points, explore data with a deeper understanding and eventually end up getting a great insight into data and gaining the highest accuracy after training it through different algorithms.

Let’s Have A Glance Through Our Dataset : The Dataset (36 rows) contains 6 Features And 2 Classes (Survived = 1, Not Survived = 0 ) Based on which we’ll plot certain graphs. Link of the dataset – Click Here To Get Complete Dataset

1. KDE PLOT : Okay So after having a glance through the dataset we can have a question. Which Age Group Has Maximum No. Of People? To answer this question we need visuals where Our KDE Plot comes into the picture, it is simply a density plot. So let’s start with importing required libraries and use its functions to plot the graph.

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
 
# KDE plot
sns.kdeplot(dataset["Age"], color = "green",
            shade = True)
plt.show()
plt.figure()


Output :

2. So now we have a clear picture of how the Count Of People vs Age-Group is distributed,  here we can see that the age group 20-40 has maximum count so let’s check it.

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
# Checking the count of Age Group 20-40
dataset.Age[(dataset["Age"] >= 20) & (dataset["Age"] <= 40)].count()


Output :

26

3. Digging deeper into visuals, to know about the variation in Fair Vs Age, what is the relation between them, let’s have a look using a different kind of kdeplot simply now there’ll be bivariate densities, we will just add the Y Variable(Fair).

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
sns.kdeplot(dataset["Age"], dataset["Fare"], shade = True)
plt.show()
plt.figure()


Output :

4. After Studying this plot a bit, we see that the intensity of the color is maximum between the age group 20-30 and precisely these have a fair between 100-200, let’s check it 

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
# Checking The Variation Between Fare And Age
dataset.Age[((dataset["Fare"] >= 100) &
             (dataset["Fare"]<=200)) &
            ((dataset["Age"]>=20) &
             dataset["Age"]<=40)].count()


Output :

16

5. We can also add a histogram to kdeplot just by using distplot() module of seaborn :

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
# Histogram+Density Plot
sns.distplot(dataset["Age"], color = "green")
plt.show()
plt.figure()


Output :

6. Well. If one wants to know about the Male Vs Female Proportion, We can plot the same in KDE itself :

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
# Adding Two Plots In One
sns.kdeplot(dataset[dataset.Gender == 'Female']['Age'],
            color = "blue")
sns.kdeplot(dataset[dataset.Gender == 'Male']['Age'],
            color = "orange", shade = True)
plt.show()
plt.figure()


Output :

7. As We can see from the plot there is an increase in the count after Age 12 till Age 40, let’s check for the same

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
# showing that there are more Male's Between Age Of 12-40
dataset.Gender[((dataset["Age"] >= 12) &
                (dataset["Age"] <= 40)) &
               (dataset["Gender"] == "Male")].count()
dataset.Gender[((dataset["Age"] >= 12) &
                (dataset["Age"] <= 40)) &
               (dataset["Gender"] == "Female")].count()


Output : 

17
15

8. VIOLIN PLOT : We have talked much about the features, now let’s talk about Survival Rate Dependency On Features. For This, We will use a classic Violin Plot, as the name suggests it portrays the same visuals as that of the musical waves of a violin. Basically A Violin Plot is used to visualize the distribution of the data and its probability density. 

What is the Relation Between Survival Rate And Age? Let’s Visually Analyze It :

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
sns.violinplot(x = 'Survived', y = 'Age', data = dataset,
               palette = {0 : "yellow", 1 : "orange"});
plt.show()
plt.figure()


Output : 

Explanation : The white dot we see in the plot is median and thick black bar in the center represents the interquartile 
range.The thin black line extended from it represents the upper (max) and lower (min) adjacent values in the data.
A Quick glance show’s us that between Age[10-20] The Survival Rate is A bit higher(Survived==1).

9. Let’s plot one more for the Survival Rate Vs Gender and Age

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
sns.violinplot(x = "Gender", y = "Age", hue = "Survived",
               data = dataset,
               palette = {0 : "yellow", 1 : "orange"})
plt.show()
plt.figure()


Here an additional attribute is hue, which refers to the binary value for Survived.

Output :

10. CATPLOT : In simple terms, catplot shows frequencies (or optionally fractions or percents) of the categories of one, two, or three categorical variables. 

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
# Plot a nested barplot to show survival for Siblings and Gender
g = sns.catplot(x = "Siblings", y = "Survived", hue = "Gender", data = dataset,
                height = 6, kind = "bar", palette = "muted")
g.despine(lef t= True)
g.set_ylabels("Survival Probability")
plt.show()


Here sns.despine is used to remove the top and right spines from the plot, let’s have a look at it.

Output : 

Here We get a clear picture of Gender Wise Survival Probability w.r.t No. Of Siblings.

11. Now, in The Dataset We See There Are Three Categories in Ticket, Which is based on Fare, Let’s Find About It (Referring This Plot I Added A Category Column For Tickets)

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
# Based On Fare There Are 3 Types Of Tickets
sns.catplot(x = "PassType", y = "Fare", data = dataset)
plt.show()
plt.figure()


Output : 

Using This we concluded that categories should be defined for tickets

12. Relation of the same with Survival Rate :

Python3




# importing the modules and dataset
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dataset = pd.read_csv("Survival.csv")
  
sns.catplot(x="PassType", y="Fare", hue="Survived",kind="swarm",data=dataset)
plt.show()
plt.figure()


Output : 

From this, we get a clear insight for Survival Rate Vs Fare w.r.t Category of Tickets.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads