Open In App

The Effects of the Depth and Number of Trees in a Random Forest

Last Updated : 03 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Random forests, powerful ensembles of decision trees, benefit from tuning key parameters like tree depth and number of trees for optimal prediction and data modeling.

In this article, we will be discussing the effects of the depth and the number of trees in a random forest model.

Random Forest

Random forest are powerful machine learning algorithms known for their accuracy and versatility. They work by combining multiple decision trees, creating a more robust model than any single tree. However, two key parameters influence a random forest’s performance: the number of trees (n_estimators) and the depth of those trees (max_depth). Let’s delve into how each affects the model.

Understanding the Impact of Depth and Number of Trees in Random Forests

  • Number of Trees (n_estimators): More trees generally lead to better accuracy, as the forest averages out the predictions of individual trees, reducing variance. However, there’s a point of diminishing returns. With too many trees, the improvement becomes negligible, and computational cost increases.Generally, increasing the number of trees leads to better accuracy. Each tree introduces a unique perspective, and averaging their predictions reduces variance, leading to a more robust model.
  • Tree Depth (max_depth): Deeper trees can capture more complex relationships in the data. But excessively deep trees can lead to overfitting, where the model memorizes the training data instead of learning general patterns.

Let’s discuss some of the ways in which this parameter may affect our model:

  • Overfitting and Underfitting: If the depth of the tree is less in number then the model might underfit the data being unable to capture the underlying patterns in the data well, giving poor performance while performing on unseen data. In the same way, if the value of the ‘max_depth’ parameter is high then the model might overfit the data and capture noise, which may lead to poor performance on test data as the model becomes bad at generalizing the data.
  • Increase in Model Complexity: The max_depth parameter of the random forest algorithm helps in determining the model complexity as the depth increases the model becomes more complex and in certain cases can lead to overfitting of data. The model might become hard to interpret if the depth of the model is in large number.
  • Increase in Computational Complexity: As the depth of the model increases the processing requirement becomes more and the time required to process the data might also increase.

Here as we have see the three most important ways in which the depth of the random forest algorithm can affect our model’s performance. Therefore, selecting the right value for the ‘max_depth’ parameter becomes an essential task in any working project so that the model doesn’t underfit or overfit the data. Let’s get into a code example to understand how the depth of the random forest algorithm affect the performance of model on data.

Implementation: Effect of Depth in a Random Forest

The depth of the random forest is defined by the parameter max_depth, which represents the longest path from the root node to the leaf node. The selection of ‘max_depth’ must be considered carefully, since it may alter how the model we work with perform.

For example, we will be using the wine quality dataset, in this dataset the quality of wine is to be predicted based on different features including the ‘fixed_acidity’, ‘volatile_acidity’, ‘citric_acid’, ‘residual sugar’, and many more of the wine.

We will be using two different random forest models in order to classify the same data with different depths and measuring the accuracy score of each model.

Let’s go through the step-by-step procedure now:

Importing Necessary Libraries

Python3
# Importing Necessary Libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


Reading and Preprocessing the Data

Python3
# Reading csv file of Wine Quality 
df = pd.read_csv('winequalityN.csv', index_col=0)

# Dropping the null value rows
df = df.dropna()

# Printing the first five rows of the dataset
df.head()

Output:

    fixed acidity    volatile acidity    citric acid    residual sugar    chlorides    free sulfur dioxide    total sulfur dioxide    density    pH    sulphates    alcohol    quality
type
white 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
white 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
white 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
white 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6

Splitting the Data into Training and Test Datasets

Python3
# Splitting the Data into Feature and Target Variables
X = df.drop('quality', axis=1)
y = df['quality']

# Creating training and test dataset
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

Creating Two different Models to Demonstrate the Effect of Depth

After splitting the dataset, we will be creating two different random forest models with ‘max_depth’ of 5 and 15 respectively and then fitting the training data to both the models.

Fitting the model

Python3
# Creating the First Random Forest Model with max_depth of 5
model_1 = RandomForestClassifier(max_depth=5, random_state=42)

# Creating the Second Random Forest Model with max_depth of 15
model_2 = RandomForestClassifier(max_depth=15, random_state=42)

# Fitting the training data to the first model
model_1.fit(x_train, y_train)

# Fitting the training data to the second model
model_2.fit(x_train, y_train)

Predicting Outcomes With Both the Models and Printing Accuracies

Finally we can predict the test dataset with the help of trained models and compare their accuracies to understand how the different depth values of models are affecting in predicting the actual value.

For this we use ‘.predict()’ method with both the models to predict the test data and then we can use the ‘accuracy_score’ to obtain how well the models performed on the test data.

Python3
# Predicting the specie of flower for the test data
model_1_pred = model_1.predict(x_test)
model_2_pred = model_2.predict(x_test)

# Checking the accuracy of how well both the models perform
model_1_acc = accuracy_score(model_1_pred, y_test)
model_2_acc = accuracy_score(model_2_pred, y_test)

# Printing the accuracies of both the models
print(f'Accuracy of First model: {model_1_acc}\nAccuracy of second model: {model_2_acc}')

Output:

Accuracy of First model: 0.5699922660479505
Accuracy of second model: 0.6890951276102089

Here we can see that increasing the value of ‘max_depth’ parameter from 5 to 15 increases the accuracy of the model to predict test data.

Therefore, we can say that in increasing the max_depth we can reduce the risk of underfitting that might happen, and other methods like hyperparameter tuning could be performed to find the best sweet spot of the parameter which increases the performance of the model.

Implementation: Effect of Number of Trees in Random Forests

The number of trees in a random forest signifies the total number of decision trees actually used in a random forest algorithm for prediction purposes. Let’s get into the coding part to compare how the increase in the value of number of trees affect the prediction ability of the random forest algorithm.

For showcasing this example we will be using the above dataset and code as it is. Let’s follow the steps given next in order to check how the number of trees affect the model.

Creating Models With Different Number of Trees

Here we will be creating two new models named ‘model_3’ and ‘model_4’ as ‘model_1’ and ‘model_2’ were created previously.

The third model contains 5 trees while the fourth model contains 500 trees in them. After the first step we will be fitting the training data to the third and the forth model.

Python3
# Creating two different models with different number of trees
model_3 = RandomForestClassifier(n_estimators = 5, random_state=42)
model_4 = RandomForestClassifier(n_estimators = 500, random_state=42)

# Training the models with the training data
model_3.fit(x_train, y_train)
model_4.fit(x_train, y_train)

Predicting Outcomes

Python3
# Predicting the test data with the help of trained models
model_3_pred = model_3.predict(x_test)
model_4_pred = model_4.predict(x_test)

# Measuring the accuracy score of the third and the fourth model
model_3_acc = accuracy_score(model_3_pred, y_test)
model_4_acc = accuracy_score(model_4_pred, y_test)

print(f'Accuracy of Third model: {model_3_acc}\nAccuracy of Fourth model: {model_4_acc}')

Output:

Accuracy of Third model: 0.6241299303944315
Accuracy of Fourth model: 0.7045630317092034

Here we can see that as we increase the number of trees from 50 to 500 in a random forest model the performance of the model is increased by 8 percent.

Conclusion

Finally, after going through the whole process we can conclude that the ‘max_depth’ parameter which signifies the depth of the random forest can result in overfitting or underfitting of data if not chosen correctly and can also increase the computational complexity of the algorithm, but if chosen correctly can work wonders for the model.

Whereas, increase in the number of trees in a random forest increases the model’s accuracy and result in less risk of overfitting, but with these positives we must bear in mind that it increases the overall complexity of the model as well.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads