Decision Tree Algorithms

Last Updated : 11 Nov, 2023

Decision trees are a type of machine-learning algorithm that can be used for both classification and regression tasks. They work by learning simple decision rules inferred from the data features. These rules can then be used to predict the value of the target variable for new data samples.

Decision trees are represented as tree structures, where each internal node represents a feature, each branch represents a decision rule, and each leaf node represents a prediction. The algorithm works by recursively splitting the data into smaller and smaller subsets based on the feature values. At each node, the algorithm chooses the feature that best splits the data into groups with different target values.

Table of Content

Understanding Decision Trees
Components of a Decision Tree
Working of the Decision Tree Algorithm
Understanding the Key Mathematical Concepts Behind Decision Trees
Types of Decision Tree Algorithms
ID3 (Iterative Dichotomiser 3)
C4.5
CART (Classification and Regression Trees)
CHAID (Chi-Square Automatic Interaction Detection)
MARS (Multivariate Adaptive Regression Splines)
Implementation of Decision Tree Algorithms

Understanding Decision Trees

A flexible and comprehensible machine learning approach for classification and regression applications is the decision tree. The conclusion, such as a class label for classification or a numerical value for regression, is represented by each leaf node in the tree-like structure that is constructed, with each internal node representing a judgment or test on a feature.

To divide the data into subsets that are as pure as possible about the target variable, the tree is built recursively, beginning at the root node and selecting the most informative characteristic. The aforementioned procedure persists until a halting condition is fulfilled, generally at attaining a specific depth or upon the node possessing a minimum quantity of data points. Decision trees are a good tool for elucidating the logic behind forecasts since they are simple to see and comprehend.

They are prone to overfitting, though, which results in unduly complicated trees. Pruning methods are employed to lessen this. Moreover, decision trees provide the foundation for ensemble techniques that aggregate many trees to increase prediction accuracy, such as Random Forests and Gradient Boosting. In conclusion, decision trees are an essential machine learning tool that is appreciated for their versatility, interpretability, and ease of use.

Components of a Decision Tree

Before we dive into the types of Decision Tree Algorithms, we need to know about the following important terms:

Root Node: It is the topmost node in the tree, which represents the complete dataset. It is the starting point of the decision-making process.
Internal Node: A node that symbolizes a choice regarding an input feature. Branching off of internal nodes connects them to leaf nodes or other internal nodes.
Leaf/Terminal Node: A node without any child nodes that indicates a class label or a numerical value.
Parent Node: The node that divides into one or more child nodes.
Child Node: The nodes that emerge when a parent node is split.

Working of the Decision Tree Algorithm

Whether employed for regression or classification, a decision tree method provides a flexible and easily interpreted machine learning technique. To create choices depending on the input features, it constructs a structure like a tree. Leaf nodes in the tree indicate the ultimate results, whereas nodes in the tree represent decisions or tests on the feature values.

Here’s a detailed breakdown of how the decision tree algorithm works:

With all the data at its starting point, the process is the root node. In order to effectively divide the data into discrete classes or values, the algorithm chooses a feature together with a threshold. Depending on the job (classification or regression), the feature and threshold are selected to maximize information gain or decrease impurity.
Depending on the outcome of the feature test, the data is separated into subgroups. When a characteristic like “Age” is used with a threshold of 30, for instance, the data is divided into two subsets: records with Age less than or equal to 30, and records with Age more than 30.
For every subgroup, the splitting procedure is repeated, resulting in child nodes. Up until a halting condition is satisfied, this recursive process keeps going. A minimal amount of data points in a node, a predetermined tree depth, or the lack of additional information gained from splits beyond that point are examples of common stopping criteria.
A node turns into a leaf node when a stopping requirement is satisfied. The final judgment or forecast is represented by the leaf nodes. Each leaf node is classified using the class label that is most common inside the subset. In a regression, the target variable’s mean or median value within the subset is usually found in the leaf node.
The tree structure that is produced can be understood. The reasoning of the model can be intuitively understood by viewing a decision path from the root to a leaf node as a set of rules.

Understanding the Key Mathematical Concepts Behind Decision Trees

To comprehend decision trees fully, it’s essential to delve into the underlying mathematical concepts that drive their decision-making process. At the heart of decision trees lie two fundamental metrics: entropy and Gini impurity. These metrics measure the impurity or disorder within a dataset and are pivotal in determining the optimal feature for splitting the data.

Entropy: Entropy, denoted by H(D) for a dataset D, measures its impurity or disorder. In the context of decision trees, entropy represents the uncertainty associated with the class labels of the data points. If a dataset is perfectly pure (all data points belong to the same class), the entropy is 0. If the classes are evenly distributed, the entropy is at its maximum.

Mathematically, entropy is calculated using the formula:

$H(D) = \Sigma^n _{i=1}\;p_{i}\; log_{2}(p_{i})$

Where p_i represents the proportion of data points belonging to class i in the dataset D. The base 2 logarithm is used to calculate entropy, resulting in entropy values measured in bits.

Information Gain: Information gain is a metric used to determine the effectiveness of a feature in reducing entropy. It quantifies the reduction in uncertainty (entropy) achieved by splitting the data based on a specific feature. Features with higher information gain are preferred for node splitting in decision trees.

Mathematically, information gain is calculated as follows:

$Information\; Gain = H(D) - \Sigma^V_{v=1} \frac{|D_{v}|}{|D|}H (D_{v})$

Where V is the number of values (unique outcomes) the feature can take, D_v represents the subset of data points for which the feature has the vth value, and ∣D∣ denotes the total number of data points in dataset D.

Gini Impurity: Gini impurity, often used in algorithms like CART (Classification and Regression Trees), measures the probability of misclassifying a randomly chosen element if it were randomly labeled according to the class distribution in the dataset. Gini impurity is computationally efficient and works well for binary splits.

Mathematically, Gini impurity for a dataset D is calculated as:

$Gini(D) = 1 - \Sigma^n _{i=1}\; p_{i}^2$

Where p_i represents the proportion of data points belonging to class i in dataset D. Lower Gini impurity values indicate a purer dataset.

Types of Decision Tree Algorithms

The different decision tree algorithms are listed below:

ID3(Iterative Dichotomiser 3)
C4.5
CART(Classification and Regression Trees)
CHAID (Chi-Square Automatic Interaction Detection)
MARS(Multivariate Adaptive Regression Splines)

ID3 (Iterative Dichotomiser 3)

An approach for decision trees called ID3 (Iterative Dichotomiser 3) is employed in classification applications. It is one of the first and most used decision tree algorithms, created by Ross Quinlan in 1986. The ID3 algorithm builds a decision tree from a given dataset using a greedy, top-down methodology.

It works by greedily choosing the feature that maximizes the information gain at each node. ID3 calculates entropy and information gain for each feature and selects the feature with the highest information gain for splitting.

ID3 uses entropy to measure the uncertainty or disorder in a dataset. Entropy, denoted by H(D) for dataset D, is calculated using the formula:

$H(D) = \Sigma^n _{i=1}\;p_{i}\; log_{2}(p_{i})$

Information gain quantifies the reduction in entropy achieved by splitting the data based on a particular feature. Features with higher information gain are preferred for splitting. Information gain is calculated as follows:

$Information\; Gain = H(D) - \Sigma^V_{v=1} \frac{|D_{v}|}{|D|}H (D_{v})$
Every decision tree node’s dataset is recursively divided using the ID3 algorithm according to the chosen attribute. This method keeps going until either there are no more attributes to divide on, or all the examples in a node belong to the same class.

The decision tree may be trimmed after it is constructed in order to enhance generalization and lessen overfitting. In order to do this, nodes that do not considerably improve the correctness of the tree must be removed.

A couple of the ID3 algorithm’s drawbacks are that it tends to overfit the training set and cannot directly handle continuous attributes. Owing to these drawbacks, other decision tree algorithms that address some of these problems have been developed, including C4.5 and CART.

Entropy, information gain, and recursive partitioning are three key principles in the ID3 algorithm, which is a fundamental technique for creating decision trees. Mastering these ideas is crucial to learning about decision tree algorithms in machine learning.

C4.5

As an enhancement to the ID3 algorithm, Ross Quinlan created the decision tree algorithm C4.5. In machine learning and data mining applications, it is a well-liked approach for creating decision trees. Certain drawbacks of the ID3 algorithm are addressed in C4.5, including its incapacity to deal with continuous characteristics and propensity to overfit the training set.

A modification of information gain known as the gain ratio is used to address the bias towards qualities with many values. It is computed by dividing the information gain by the intrinsic information, which is a measurement of the quantity of data required to characterize an attribute’s values.

$Gain Ratio = \frac{Split\; gain}{Gain\;\;information}$

Where Split Information represents the entropy of the feature itself. The feature with the highest gain ratio is chosen for splitting.

When dealing with continuous attributes, C4.5 sorts the attribute’s values first, and then chooses the midpoint between each pair of adjacent values as a potential split point. Next, it determines which split point has the largest value by calculating the information gain or gain ratio for each.

By turning every path from the root to a leaf into a rule, C4.5 can also produce rules from the decision tree. Predictions based on fresh data can be generated using the rules.

C4.5 is an effective technique for creating decision trees that can produce rules from the tree and handle both discrete and continuous attributes. The model’s accuracy is increased and overfitting is prevented by its utilization of gain ratio and decreased error pruning. Nevertheless, it might still be susceptible to noisy data and might not function effectively on datasets with a lot of features.

CART (Classification and Regression Trees)

CART is a decision tree algorithm that can be used for both classification and regression tasks. It works by finding splits that minimize the Gini impurity, a measure of impurity in the data. CART uses Gini impurity for classification. When selecting a feature to split, it calculates the Gini impurity for each possible split and chooses the one with the lowest impurity.

The likelihood of incorrectly classifying an element selected at random and labeled in accordance with the distribution of labels in the set is measured by the Gini impurity.

Gini Impurity (for Classification) :CART uses Gini impurity as the criterion to measure the impurity or purity of a dataset. Gini impurity, denoted by Gini(D) for dataset D, is calculated using the formula:
$Gini(D) = 1 - \Sigma^n _{i=1}\; p^2_{i}$
CART can be used to create regression trees for continuous target variables in addition to classification. When deciding which subsets to split, the algorithm in this instance minimizes the variance of the target variable inside each subset.

Mean Squared Error (for Regression): For regression tasks, CART uses mean squared error (MSE) to evaluate splits. MSE measures the average squared difference between the predicted and actual values. The split with the lowest MSE is chosen.
$MSE(D) = \frac{1}{|D|}\Sigma^{|D|} _{i=1}(y_{i} - \overline{y})^2$
Where $y_{i}$ represents the target values, $\overline{y}$ is the mean of the target values in dataset D, and ∣D∣ is the number of data points in D.

Recursively dividing the dataset according to the characteristic that minimizes the Gini impurity or maximizes the information gain at each stage is done by CART using a greedy strategy. It looks at all potential split points for each attribute and selects the one that produces the lowest Gini impurity for the subsets that are generated.

To lessen overfitting, CART employs a method known as cost-complexity pruning once the decision tree is constructed. This entails determining the tree that minimizes the total cost, which is the sum of the impurity and the complexity, by adding a complexity parameter to the impurity measure.

Every internal node in a binary tree created by CART has exactly two child nodes. This facilitates the splitting procedure and facilitates the interpretation of the resultant trees.

CHAID (Chi-Square Automatic Interaction Detection)

CHAID is a decision tree algorithm that uses chi-square tests to determine the best splits for categorical variables. It works by recursively splitting the data into smaller and smaller subsets until each subset contains only data points of the same class or within a certain range of values. The algorithm selects the feature to split on at each node based on the chi-squared test of independence, which is a statistical test that measures the relationship between two variables. In CHAID, the algorithm selects the feature that has the highest chi-squared statistic, which means that it has the strongest relationship with the target variable. It is particularly useful for analyzing large datasets with many categorical variables.

To perform the Chi-Square test, one must compute the Chi-Square statistic, which may be found using the following formula:

$X^2 = \Sigma \frac{(O_{i} - E_{i})^2}{E_{i}}$

Where,

O_irepresents the observed frequency and E_irepresents the expected frequency in each category. The observed distribution is compared to the expected distribution using the Chi-Square statistic to see if there is a significant difference.

CHAID can be used for both classification and regression tasks. In classification tasks, the algorithm predicts the class label of a new data point by traversing the decision tree from the root node to a leaf node. The class label of the leaf node is then assigned to the new data point.

In regression tasks, CHAID predicts the value of the target variable for a new data point by averaging the values of the target variable at the leaf node where the new data point falls.

MARS (Multivariate Adaptive Regression Splines)

MARS is an extension of CART that uses splines to model non-linear relationships between variables. MARS is a regression algorithm that uses a technique called forward stepwise selection to construct a piecewise linear model. A piecewise linear model is a model where the output variable is a linear function of the input variables, but the slope of the linear function can change at different points in the input space. The sites where piecewise linear functions (basis functions) connect are known as knots.Based on the distribution of the data and the requirement to capture non-linearities, MARS automatically chooses and positions knots

Basis Functions: Basis functions, or piecewise linear functions, are used by MARS to represent the relationship between predictors and the response variable. Simple linear functions that are defined across a particular range of a predictor variable make up each basis function.

In MARS, a basis function is described as:

$h(x) = \Bigg \{ x - t \;\; if \; x>t \\ t-x \;\; if x \leq t \Bigg\}$

Where , x is a predictor variable and t is the knot function.

Knot Function: The sites where piecewise linear functions (basis functions) connect are known as knots. Based on the distribution of the data and the requirement to capture non-linearities, MARS automatically chooses and positions knots

MARS starts by constructing a model with a single piece. The algorithm then uses forward stepwise selection to add new pieces to the model. At each step, the algorithm adds the piece that reduces the residual sum of squares the most. The algorithm continues adding pieces until the model reaches a specified level of complexity. MARS can be used to model complex relationships between variables. It is particularly useful for modeling complex relationships in data.

Implementation of Decision Tree Algorithms

Scikit-Learn, a powerful open-source library in Python, provides a simple and efficient way to implement decision tree algorithms.

Importing necessary libraries

We import the necessary libraries:

Python3

#importing libraries 
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

DecisionTreeClassifier from sklearn.tree: This is the class that allows us to create classification decision tree models.
pandas as pd: Used for data manipulation.
train_test_split from sklearn.model_selection: Used to split the dataset into training and testing sets.
accuracy_score from sklearn.metrics: This is used to evaluate the classification model.

Dataset Loading and Splitting

We load the dataset from a CSV File for Diabetes Prediction.

Python3

#loading dataset 
data = pd.read_csv('diabetes.csv')
X = data.drop('Outcome', axis=1)
y = data['Outcome']
print(data.head())

Output:

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   
   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1

Splitting dataset

Python3

#splitting dataset 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

The dataset is split into 8 features (BMI, insulin level, age, etc.) and the target variable (Outcome whether patient has diabetes or not). We further split the data into training and testing sets using train_test_split, with 80% of the data used for training and 20% for testing. random_state ensures reproducibility.

Model Training

Python3

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

Moving on with training and evaluation of the model. DecisionTreeClassifier object clf is created. The fit method is used to train the classifier using the training data (X_train and y_train).

Python3

predictions = clf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy * 100:.2f}")

Output:

Accuracy: 74.68

We make predictions on the test data using the trained model and calculate the accuracy score to evaluate the model’s performance. The model is used to predict the labels for the test set (X_test) using the predict method. The accuracy of the model is then calculated by comparing the predicted labels with the actual labels (y_test) using the accuracy_score function.

Use Cases and Importance of Decision Tree Algorithms

Decision tree algorithms are widely used in a variety of machine learning applications, including:

Fraud detection: Decision tree algorithms can be used to identify fraudulent transactions and other types of anomalous behavior.
Risk assessment: Decision tree algorithms can be used to assess the risk of different events, such as loan defaults or customer churn.
Medical diagnosis: Decision tree algorithms can be used to help doctors diagnose diseases and other medical conditions.
Marketing: Decision tree algorithms can be used to segment customers and target them with personalized marketing campaigns.

Decision tree algorithms are important because they are relatively simple to understand and interpret. They are also very versatile and can be used for a wide range of machine learning tasks.

Advantages of Decision Tree Algorithms

Easy to understand and interpret: Decision trees can be easily understood and interpreted by humans, even those without a machine learning background. This makes them a good choice for applications where it is important to be able to explain the model’s predictions.
Versatile: Decision tree algorithms can be used for a wide range of machine learning tasks, including classification, regression, and anomaly detection.
Robust to noise: Decision tree algorithms are relatively robust to noise in the data. This is because they make predictions based on the overall trend of the data, rather than individual data points.

Limitations and Considerations

Overfitting: Decision trees can be prone to overfitting, capturing noise in the data. Techniques like pruning, setting minimum samples per leaf, or limiting the tree depth can mitigate this issue.
Bias: Certain tree structures can exhibit bias towards features with more levels. Proper feature scaling or using algorithms like C4.5, which consider gain ratio, can address this bias.

Conclusion

Decision tree algorithms, with their intuitive nature and interpretability, serve as invaluable tools in the world of machine learning. Overall, decision tree algorithms are a powerful and versatile machine learning tool that can be used for a wide range of tasks. They are relatively simple to understand and interpret, but they can be prone to overfitting. By using techniques such as pruning and regularization, this can be mitigated.

Suggest improvement

C5.0 Algorithm of Decision Tree

Share your thoughts in the comments

Decision Tree Algorithms

Understanding Decision Trees

Components of a Decision Tree

Working of the Decision Tree Algorithm

Understanding the Key Mathematical Concepts Behind Decision Trees

Types of Decision Tree Algorithms

ID3 (Iterative Dichotomiser 3)

C4.5

CART (Classification and Regression Trees)

CHAID (Chi-Square Automatic Interaction Detection)

MARS (Multivariate Adaptive Regression Splines)

Implementation of Decision Tree Algorithms

Importing necessary libraries

Python3

Dataset Loading and Splitting

Python3

Splitting dataset

Python3

Model Training

Python3

Python3

Use Cases and Importance of Decision Tree Algorithms

Advantages of Decision Tree Algorithms

Limitations and Considerations

Conclusion

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?