Decision Tree Introduction with example

Below are some assumptions that we made while using decision tree:

As you can see from the above image that Decision Tree works on the Sum of Product form which is also known as Disjunctive Normal Form. In the above image, we are predicting the use of computer in the daily life of the people.

In Decision Tree the major challenge is to identification of the attribute for the root node in each level. This process is known as attribute selection. We have two popular attribute selection measures:



  1. Information Gain
  2. Gini Index

1. Information Gain
When we use a node in a decision tree to partition the training instances into smaller subsets the entropy changes. Information gain is a measure of this change in entropy.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and Values (A) is the set of all possible values of A, then

Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy more the information content.
Definition: Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and Values (A) is the set of all possible values of A, then

Example:

For the set X = {a,a,a,b,b,b,b,b}
Total intances: 8
Instances of b: 5
Instances of a: 3


              = -[0.375 * (-1.415) + 0.625 * (-0.678)] 
              =-(-0.53-0.424) 
              = 0.954

Building Decision Tree using Information Gain
The essentials:

Example:
Now, lets draw a Decision Tree for the following data using Information gain.

Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II

Here, we have 3 features and 2 output classes.
To build a decision tree using Information gain. We will take each of the feature and calculate the information for each feature.

Split on feature X

Split on feature Y



Split on feature Z

From the above images we can see that the information gain is maximum when we make a split on feature Y. So, for the root node best suited feature is feature Y. Now we can see that while spliting the dataset by feature Y, the child contains pure subset of the target variable. So we don’t need to further split the dataset.

The final tree for the above dataset would be look like this:

2. Gini Index

    A       B        C         D
  >= 5     >= 3.0      >= 4.2    >= 1.4
   < 5      < 3.0       < 4.2     < 1.4

Calculating Gini Index for Var A:
Value >= 5: 12
Attribute A >= 5 & class = positive:
Attribute A >= 5 & class = negative:
Gini(5, 7) = 1 –
Value
Attribute A
Attribute A
Gini(3, 1) = 1 –
By adding weight and sum each of the gini indices:

Calculating Gini Index for Var B:
Value >= 3: 12
Attribute B >= 3 & class = positive:
Attribute B >= 5 & class = negative:
Gini(5, 7) = 1 –
Value
Attribute A
Attribute A
Gini(3, 1) = 1 –
By adding weight and sum each of the gini indices:

Using the same approach we can calculate the Gini index for C and D attributes.

             Positive    Negative
For A|>= 5.0    5       7
     |<5    3       1
Ginin Index of A = 0.45825    
             Positive    Negative
For B|>= 3.0    8       4
     |< 3.0    0       4
Gini Index of B= 0.3345
             Positive    Negative
For C|>= 4.2    0       6
     |< 4.2    8       2
Gini Index of C= 0.2    
             Positive    Negative
For D|>= 1.4    0       5
     |< 1.4    8       3
Gini Index of D= 0.273
 

Reference: dataaspirant




Check out this Author's contributed articles.

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :
Practice Tags :