Prerequisites: Decision Tree, DecisionTreeClassifier, sklearn, numpy, pandas
Decision Tree is one of the most powerful and popular algorithm. Decisiontree algorithm falls under the category of supervised learning algorithms. It works for both continuous as well as categorical output variables.
In this article, We are going to implement a Decision tree algorithm on the Balance Scale Weight & Distance Database presented on the UCI.
Dataset Description :
Title : Balance Scale Weight & Distance Database Number of Instances: 625 (49 balanced, 288 left, 288 right) Number of Attributes: 4 (numeric) + class name = 5 Attribute Information:
 Class Name (Target variable): 3
 L [balance scale tip to the left]
 B [balance scale be balanced]
 R [balance scale tip to the right]
 46.08 percent are L
 07.84 percent are B
 46.08 percent are R
Used Python Packages :

sklearn :
 In python, sklearn is a machine learning package which include a lot of ML algorithms.
 Here, we are using some of its modules like train_test_split, DecisionTreeClassifier and accuracy_score.

NumPy :
 It is a numeric python module which provides fast maths functions for calculations.
 It is used to read data in numpy arrays and for manipulation purpose.

Pandas :
 Used to read and write different files.
 Data manipulation can be done easily with dataframes.
Installation of the packages :
In Python, sklearn is the package which contains all the required packages to implement Machine learning algorithm. You can install the sklearn package by following the commands given below.
using pip :
pip install U scikitlearn
Before using the above command make sure you have scipy and numpy packages installed.
If you don’t have pip. You can install it using
python getpip.py
using conda :
conda install scikitlearn
Assumptions we make while using Decision tree :
 At the beginning, we consider the whole training set as the root.
 Attributes are assumed to be categorical for information gain and for gini index, attributes are assumed to be continuous.
 On the basis of attribute values records are distributed recursively.
 We use statistical methods for ordering attributes as root or internal node.
Pseudocode :
 Find the best attribute and place it on the root node of the tree.
 Now, split the training set of the dataset into subsets. While making the subset make sure that each subset of training dataset should have the same value for an attribute.
 Find leaf nodes in all branches by repeating 1 and 2 on each subset.
While implementing the decision tree we will go through the following two phases:
 Building Phase
 Preprocess the dataset.
 Split the dataset from train and test using Python sklearn package.
 Train the classifier.
 Operational Phase
 Make predictions.
 Calculate the accuracy.
Data Import :
Data Slicing :
X = balance_data.values[:, 1:5] Y = balance_data.values[:,0]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
Terms used in code :
Gini index and information gain both of these methods are used to select from the n attributes of the dataset which attribute would be placed at the root node or the internal node.
Gini index
Entropy
Information Gain
Accuracy score
Confusion Matrix
Below is the python code for the decision tree.
# Run this program on your local python # interpreter, provided you have installed # the required libraries. # Importing the required packages import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
# Function importing Dataset def importdata():
balance_data = pd.read_csv(
'databases/balancescale/balancescale.data' ,
sep = ',' , header = None )
# Printing the dataswet shape
print ( "Dataset Lenght: " , len (balance_data))
print ( "Dataset Shape: " , balance_data.shape)
# Printing the dataset obseravtions
print ( "Dataset: " ,balance_data.head())
return balance_data
# Function to split the dataset def splitdataset(balance_data):
# Seperating the target variable
X = balance_data.values[:, 1 : 5 ]
Y = balance_data.values[:, 0 ]
# Spliting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3 , random_state = 100 )
return X, Y, X_train, X_test, y_train, y_test
# Function to perform training with giniIndex. def train_using_gini(X_train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini" ,
random_state = 100 ,max_depth = 3 , min_samples_leaf = 5 )
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Function to perform training with entropy. def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy" , random_state = 100 ,
max_depth = 3 , min_samples_leaf = 5 )
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
# Function to make predictions def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print ( "Predicted values:" )
print (y_pred)
return y_pred
# Function to calculate accuracy def cal_accuracy(y_test, y_pred):
print ( "Confusion Matrix: " ,
confusion_matrix(y_test, y_pred))
print ( "Accuracy : " ,
accuracy_score(y_test,y_pred) * 100 )
print ( "Report : " ,
classification_report(y_test, y_pred))
# Driver code def main():
# Building Phase
data = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print ( "Results Using Gini Index:" )
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print ( "Results Using Entropy:" )
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)
# Calling main function if __name__ = = "__main__" :
main()

Data Infomation:
Dataset Lenght: 625 Dataset Shape: (625, 5) Dataset: 0 1 2 3 4 0 B 1 1 1 1 1 R 1 1 1 2 2 R 1 1 1 3 3 R 1 1 1 4 4 R 1 1 1 5Results Using Gini Index:
Predicted values: ['R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'R'] Confusion Matrix: [[ 0 6 7] [ 0 67 18] [ 0 19 71]] Accuracy : 73.4042553191 Report : precision recall f1score support B 0.00 0.00 0.00 13 L 0.73 0.79 0.76 85 R 0.74 0.79 0.76 90 avg/total 0.68 0.73 0.71 188Results Using Entropy:
Predicted values: ['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R'] Confusion Matrix: [[ 0 6 7] [ 0 63 22] [ 0 20 70]] Accuracy : 70.7446808511 Report : precision recall f1score support B 0.00 0.00 0.00 13 L 0.71 0.74 0.72 85 R 0.71 0.78 0.74 90 avg / total 0.66 0.71 0.68 188
