Iterative Dichotomiser 3 (ID3) Algorithm From Scratch

In the realm of machine learning and data mining, decision trees stand as versatile tools for classification and prediction tasks. The ID3 (Iterative Dichotomiser 3) algorithm serves as one of the foundational pillars upon which decision tree learning is built. Developed by Ross Quinlan in the 1980s, ID3 remains a fundamental algorithm, forming the basis for subsequent tree-based methods like C4.5 and CART (Classification and Regression Trees).

Introduction to Decision Trees

Machine learning models called decision trees divide the input data recursively according to features to arrive at a decision. Every internal node symbolizes a feature, and every branch denotes a potential result of that feature. It is simple to interpret and visualize thanks to the tree structure. Every leaf node makes a judgment call or forecast. To optimize information acquisition or limit impurity, the best feature is chosen at each stage of creation. Decision trees are adaptable and can be used for both regression and classification applications. Although they can overfit, this is frequently avoided by employing strategies like pruning.

Decision Trees

Before delving into the intricacies of the ID3 algorithm, let’s grasp the essence of decision trees. Picture a tree-like structure where each internal node represents a test on an attribute, each branch signifies an outcome of that test, and each leaf node denotes a class label or a decision. Decision trees mimic human decision-making processes by recursively splitting data based on different attributes to create a flowchart-like structure for classification or regression.

ID3 Algorithm

A well-known decision tree approach for machine learning is the Iterative Dichotomiser 3 (ID3) algorithm. By choosing the best characteristic at each node to partition the data depending on information gain, it recursively constructs a tree. The goal is to make the final subsets as homogeneous as possible. By choosing features that offer the greatest reduction in entropy or uncertainty, ID3 iteratively grows the tree. The procedure keeps going until a halting requirement is satisfied, like a minimum subset size or a maximum tree depth. Although ID3 is a fundamental method, other iterations such as C4.5 and CART have addresse

How ID3 Works

The ID3 algorithm is specifically designed for building decision trees from a given dataset. Its primary objective is to construct a tree that best explains the relationship between attributes in the data and their corresponding class labels.

1. Selecting the Best Attribute

ID3 employs the concept of entropy and information gain to determine the attribute that best separates the data. Entropy measures the impurity or randomness in the dataset.
The algorithm calculates the entropy of each attribute and selects the one that results in the most significant information gain when used for splitting the data.

2. Creating Tree Nodes

The chosen attribute is used to split the dataset into subsets based on its distinct values.
For each subset, ID3 recurses to find the next best attribute to further partition the data, forming branches and new nodes accordingly.

3. Stopping Criteria

The recursion continues until one of the stopping criteria is met, such as when all instances in a branch belong to the same class or when all attributes have been used for splitting.

4. Handling Missing Values

ID3 can handle missing attribute values by employing various strategies like attribute mean/mode substitution or using majority class values.

5. Tree Pruning

Pruning is a technique to prevent overfitting. While not directly included in ID3, post-processing techniques or variations like C4.5 incorporate pruning to improve the tree’s generalization.

Mathematical Concepts of ID3 Algorithm

Now let’s examine the formulas linked to the main theoretical ideas in the ID3 algorithm:

1. Entropy

A measure of disorder or uncertainty in a set of data is called entropy. Entropy is a tool used in ID3 to measure a dataset’s disorder or impurity. By dividing the data into as homogenous subsets as feasible, the objective is to minimize entropy.

For a set S with classes {c1, c2, …, cn}, the entropy is calculated as:

Where, p_iis the proportion of instances of class c_iin the set.

2. Information Gain

A measure of how well a certain quality reduces uncertainty is called Information Gain. ID3 splits the data at each stage, choosing the property that maximizes Information Gain. It is computed using the distinction between entropy prior to and following the split.

Information Gain measures the effectiveness of an attribute A in reducing uncertainty in set S.

Where, |S_v| is the size of the subset of S for which attribute A has value v.

3. Gain Ratio

Gain Ratio is an improvement on Information Gain that considers the inherent worth of characteristics that have a wide range of possible values. It deals with the bias of Information Gain in favor of characteristics with more pronounced values.

Iterative Dichotomiser 3 (ID3) Implementation using Python

Let’s create a simplified version of the ID3 algorithm from scratch using Python.

Importing Libraries

Importing the necessary libraries:

Python3

from collections import Counter

import numpy as np

collections for the Counter class to count occurrences.
numpy as np for numerical operations and array handling.

Defining Node Class

Python3

class Node:

    def __init__(self, feature=None, value=None, results=None, true_branch=None, false_branch=None):

        self.feature = feature  # Feature to split on

        self.value = value      # Value of the feature to split on

        self.results = results  # Stores class labels if node is a leaf node

        self.true_branch = true_branch  # Branch for values that are True for the feature

        self.false_branch = false_branch  # Branch for values that are False for the feature

The provided Python code defines a class called Node for constructing nodes in a decision tree. Each node encapsulates information crucial for decision-making within the tree. The feature attribute signifies the feature used for splitting, while value stores the specific value of that feature for the split. In the case of a leaf node, results holds class labels. The node also has branches, with true_branch representing the path for values evaluating to True for the feature, and false_branch for values evaluating to False. This class forms a fundamental building block for creating decision trees, enabling the representation of decision points and outcomes in a hierarchical structure.

Entropy Calculation Function

Python3

def entropy(data):

    counts = np.bincount(data)

    probabilities = counts / len(data)

    entropy = -np.sum([p * np.log2(p) for p in probabilities if p > 0])

    return entropy

The entropy function calculates the entropy of a given dataset using the formula for information entropy. It first computes the counts of occurrences for each unique element in the dataset using np.bincount. Then, it calculates the probabilities of each element and uses these probabilities to compute the entropy using the standard formula – . The function ensures that the logarithm is not taken for zero probabilities, avoiding mathematical errors. The result is the entropy value for the input dataset, reflecting its degree of disorder or uncertainty.

Splitting Data Function

Python3

def split_data(X, y, feature, value):

    true_indices = np.where(X[:, feature] <= value)[0]

    false_indices = np.where(X[:, feature] > value)[0]

    true_X, true_y = X[true_indices], y[true_indices]

    false_X, false_y = X[false_indices], y[false_indices]

    return true_X, true_y, false_X, false_y

The split_data function divides a dataset into two subsets based on a specified feature and threshold value. It uses NumPy to identify indices where the feature values satisfy the condition (<= value for the true branch and > value for the false branch). Then, it extracts the corresponding subsets for features (true_X and false_X) and labels (true_y and false_y). The function returns these subsets, enabling the partitioning of data for further use in constructing a decision tree.

Building the Tree Function

Python3

def build_tree(X, y):

    if len(set(y)) == 1:

        return Node(results=y[0])
 
    best_gain = 0

    best_criteria = None

    best_sets = None

    n_features = X.shape[1]
 
    current_entropy = entropy(y)
 
    for feature in range(n_features):

        feature_values = set(X[:, feature])

        for value in feature_values:

            true_X, true_y, false_X, false_y = split_data(X, y, feature, value)

            true_entropy = entropy(true_y)

            false_entropy = entropy(false_y)

            p = len(true_y) / len(y)

            gain = current_entropy - p * true_entropy - (1 - p) * false_entropy
 
            if gain > best_gain:

                best_gain = gain

                best_criteria = (feature, value)

                best_sets = (true_X, true_y, false_X, false_y)
 
    if best_gain > 0:

        true_branch = build_tree(best_sets[0], best_sets[1])

        false_branch = build_tree(best_sets[2], best_sets[3])

        return Node(feature=best_criteria[0], value=best_criteria[1], true_branch=true_branch, false_branch=false_branch)
 
    return Node(results=y[0])

The build_tree function recursively constructs a decision tree using the ID3 algorithm. It first checks if the labels in the current subset are homogenous; if so, it creates a leaf node with the corresponding class label. Otherwise, it iterates through all features and values, calculating information gain for each split and identifying the one with the highest gain. The function then recursively calls itself to build the true and false branches using the best split criteria. The resulting decision tree is constructed and returned. The process continues until further splits do not yield positive information gain, resulting in the creation of leaf nodes.

Prediction Function

Python3

def predict(tree, sample):

    if tree.results is not None:

        return tree.results

    else:

        branch = tree.false_branch

        if sample[tree.feature] <= tree.value:

            branch = tree.true_branch

        return predict(branch, sample)

The predict function uses a trained decision tree to predict the class label for a given sample. It recursively navigates the tree by checking if the current node is a leaf node (indicated by non-None results). If it is a leaf, it returns the class labels. Otherwise, it determines the next branch to traverse based on the feature value of the sample compared to the node’s splitting criteria. The function then calls itself with the appropriate branch until a leaf node is reached, providing the final predicted class labels for the input sample.

Dataset and Tree Building

Python3

X = np.array([[1, 1], [1, 0], [0, 1], [0, 0]])

y = np.array([1, 1, 0, 0])
 
# Building the tree

decision_tree = build_tree(X, y)

The code creates a dataset X with binary features and their corresponding labels y. Then, it constructs a decision tree using the build_tree function, which recursively builds the tree using the ID3 algorithm based on the provided dataset. The resulting decision_tree is the root node of the constructed decision tree.

Prediction

Python3

sample = np.array([1, 0])

prediction = predict(decision_tree, sample)

print(f"Prediction for sample {sample}: {prediction}")

Output:

Prediction for sample [1 0]: 1

Predicts the class label for the sample using the built decision tree and prints the prediction.
If we want to predict the class label for the sample [1, 0], the algorithm will traverse the decision tree starting from the root node. As Feature 0 is 1 (greater than 0.5), it will follow the False branch, and thus the prediction will be 1 (Class 1).

Advantages and Limitations of ID3

Advantages

Interpretability: Decision trees generated by ID3 are easily interpretable, making them suitable for explaining decisions to non-technical stakeholders.
Handles Categorical Data: ID3 can effectively handle categorical attributes without requiring explicit data preprocessing steps.
Computationally Inexpensive: The algorithm is relatively straightforward and computationally less expensive compared to some complex models.

Limitations

Overfitting: ID3 tends to create complex trees that may overfit the training data, impacting generalization to unseen instances.
Sensitive to Noise: Noise or outliers in the data can lead to the creation of non-optimal or incorrect splits.
Binary Trees Only: ID3 constructs binary trees, limiting its ability to represent more complex relationships present in the data directly.

Conclusion

The ID3 algorithm laid the groundwork for decision tree learning, providing a robust framework for understanding attribute selection and recursive partitioning. Despite its limitations, ID3’s simplicity and interpretability have paved the way for more sophisticated algorithms that address its drawbacks while retaining its essence.

As machine learning continues to evolve, the ID3 algorithm remains a crucial piece in the mosaic of tree-based methods, serving as a stepping stone for developing more advanced and accurate models in the quest for efficient data analysis and pattern recognition.

Article Tags :

AI-ML-DS

Geeks Premier League

Machine Learning

Geeks Premier League 2023

Machine Learning