Open In App

Analyzing Decision Tree and K-means Clustering using Iris dataset

Iris Dataset is one of best know datasets in pattern recognition literature. This dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2 the latter are NOT linearly separable from each other.

Attribute Information:

  1. Sepal Length in cm
  2. Sepal Width in cm
  3. Petal Length in cm
  4. al Width in cm
  5. Class:
    • Iris Setosa
    • Iris Versicolour
    • Iris Virginica

Let’s perform Exploratory data analysis on the dataset to get our initial investigation right.



Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.




import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree
  
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
  
import warnings
warnings.filterwarnings('ignore')

Now let’s load the dataset from sklearn.datasets and seaborn.






iris = load_iris()
iris = sns.load_dataset('iris')
iris.head()

Output:




iris_setosa = iris.loc[iris["species"] == "Iris-setosa"]
iris_virginica = iris.loc[iris["species"] == "Iris-virginica"]
iris_versicolor = iris.loc[iris["species"] == "Iris-versicolor"]
  
sns.FacetGrid(iris,
              hue="species",
              size=3).map(sns.distplot,
                          "petal_length").add_legend()
sns.FacetGrid(iris,
              hue="species",
              size=3).map(sns.distplot,
                          "petal_width").add_legend()
sns.FacetGrid(iris,
              hue="species",
              size=3).map(sns.distplot,
                          "sepal_length").add_legend()
plt.show()

Output:

Distribution plot for the features of the three classes available in the dataset three classes

Insights from EDA:

  1. It seems feature Petal length properly differentiates  classes
  2. Hence  feature importance of petal length should be more

Decision Tree Algorithm with Iris Dataset

A Decision Tree  is one of the popular algorithms for classification and prediction tasks and also a supervised machine learning algorithm




X = iris.iloc[:, :-2]
y = iris.target
X_train, X_test,\
    y_train, y_test = train_test_split(X, y,
                                       test_size=0.33,
                                       random_state=42)
treemodel = DecisionTreeClassifier()
treemodel.fit(X_train, y_train)

Now let’s check the performance of the Decision tree model.




plt.figure(figsize=(15, 10))
tree.plot_tree(treemodel, filled=True)
ypred = treemodel.predict(X_test)
score = accuracy_score(ypred, y_test)
print(score)
print(classification_report(ypred, y_test))

Output:

0.98
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      0.94      0.97        16
           2       0.94      1.00      0.97        15

    accuracy                           0.98        50
   macro avg       0.98      0.98      0.98        50
weighted avg       0.98      0.98      0.98        50

Analyzing Decision Tree formed by the model

One of the advantages of using decision trees over other models is decision trees are highly interpretable and feature selection is automatic hence proper analysis can be done on decision trees. By seeing the above tree we can interpret that.

KMeans Clustering with Iris Dataset

 K-means clustering is an Unsupervised machine learning algorithm. 




wcss = []
  
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i,
                    init='k-means++',
                    max_iter=300,
                    n_init=10,
                    random_state=0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
      
# from above array with help of elbow method
#we can get no of cluster to provide.
kmeans = KMeans(n_clusters=3,
                init='k-means++',
                max_iter=300,
                n_init=10,
                random_state=0)
y_kmeans = kmeans.fit_predict(x)

In the above code, we have used the elbow method to get the optimized value of k. If we plot a graph for it we get a value of 3.

Visualizing the Clusters




# Visualising the clusters
cols = iris.columns
plt.scatter(X.loc[y_kmeans == 0, cols[0]],
            X.loc[y_kmeans == 0, cols[1]],
            s=100, c='purple',
            label='Iris-setosa')
plt.scatter(X.loc[y_kmeans == 1, cols[0]],
            X.loc[y_kmeans == 1, cols[1]],
            s=100, c='orange',
            label='Iris-versicolour')
plt.scatter(X.loc[y_kmeans == 2, cols[0]],
            X.loc[y_kmeans == 2, cols[1]],
            s=100, c='green',
            label='Iris-virginica')
  
# Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0],
            kmeans.cluster_centers_[:, 1],
            s=100, c='red',
            label='Centroids')
  
plt.legend()

Output:

Clusters obtained by using the K-means algorithm

 

Accuracy and Performance of Model

Now let’s check the performance of the model.




pd.crosstab(iris.target, y_kmeans)

Output:

 

As the algorithm is an unsupervised algorithm we don’t have test data here to check the performance of the model on it. Setosa class is clustered perfectly. While Versicolor has only 2 misclassifications.  Class virginica is getting overlapped Versicolor hence there is 14 misclassifications.


Article Tags :