Open In App

Random Forest Regression in Python

Random Forest Regression is a versatile machine-learning technique for predicting numerical values. It combines the predictions of multiple decision trees to reduce overfitting and improve accuracy. Python’s machine-learning libraries make it easy to implement and optimize this approach.

Ensemble Learning

Ensemble learning is a machine learning technique that combines the predictions from multiple models to create a more accurate and stable prediction. It is an approach that leverages the collective intelligence of multiple models to improve the overall performance of the learning system.

Types of Ensemble Methods

There are various types of ensemble learning methods, including:

  1. Bagging (Bootstrap Aggregating): This method involves training multiple models on random subsets of the training data. The predictions from the individual models are then combined, typically by averaging.
  2. Boosting: This method involves training a sequence of models, where each subsequent model focuses on the errors made by the previous model. The predictions are combined using a weighted voting scheme.
  3. Stacking: This method involves using the predictions from one set of models as input features for another model. The final prediction is made by the second-level model.

Random Forest

A random forest is an ensemble learning method that combines the predictions from multiple decision trees to produce a more accurate and stable prediction. It is a type of supervised learning algorithm that can be used for both classification and regression tasks.

Every decision tree has high variance, but when we combine all of them in parallel then the resultant variance is low as each decision tree gets perfectly trained on that particular sample data, and hence the output doesn’t depend on one decision tree but on multiple decision trees. In the case of a classification problem, the final output is taken by using the majority voting classifier. In the case of a regression problem, the final output is the mean of all the outputs. This part is called Aggregation

Random Forest Regression Model Working

What is Random Forest Regression?

Random Forest Regression in machine learning is an ensemble technique capable of performing both regression and classification tasks with the use of multiple decision trees and a technique called Bootstrap and Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple decision trees in determining the final output rather than relying on individual decision trees. 

Random Forest has multiple decision trees as base learning models. We randomly perform row sampling and feature sampling from the dataset forming sample datasets for every model. This part is called Bootstrap.

We need to approach the Random Forest regression technique like any other machine learning technique.

Random Forest Regression in Python

We will be using a similar sample technique in the below example. Below is a step-by-step sample implementation of Random Forest Regression, on the dataset that can be downloaded here- https://bit.ly/417n3N5

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

Step-1: Import Libraries

Here we are importing all the necessary libraries required.




import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import warnings
 
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
 
warnings.filterwarnings('ignore')

Step-2: Import Dataset

Now let’s load the dataset in the panda’s data frame. For better data handling and leveraging the handy functions to perform complex tasks in one go.




df= pd.read_csv('Salaries.csv')
print(df)

Output:

            Position  Level   Salary
0 Business Analyst 1 45000
1 Junior Consultant 2 50000
2 Senior Consultant 3 60000
3 Manager 4 80000
4 Country Manager 5 110000
5 Region Manager 6 150000
6 Partner 7 200000
7 Senior Partner 8 300000
8 C-level 9 500000
9 CEO 10 1000000


Here the .info() method provides a quick overview of the structure, data types, and memory usage of the dataset.




df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Position 10 non-null object
1 Level 10 non-null int64
2 Salary 10 non-null int64
dtypes: int64(2), object(1)
memory usage: 372.0+ bytes

Step-3: Data Preparation

Here the code will extracts two subsets of data from the Dataset and stores them in separate variables.




# Assuming df is your DataFrame
X = df.iloc[:,1:2].values  #features
y = df.iloc[:,2].values  # Target variable

Step-4: Random Forest Regressor Model

The code processes categorical data by encoding it numerically, combines the processed data with numerical data, and trains a Random Forest Regression model using the prepared data.




import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
 
 Check for and handle categorical variables
label_encoder = LabelEncoder()
x_categorical = df.select_dtypes(include=['object']).apply(label_encoder.fit_transform)
x_numerical = df.select_dtypes(exclude=['object']).values
x = pd.concat([pd.DataFrame(x_numerical), x_categorical], axis=1).values
 
# Fitting Random Forest Regression to the dataset
regressor = RandomForestRegressor(n_estimators=10, random_state=0, oob_score=True)
 
# Fit the regressor with x and y data
regressor.fit(x, y)

Step-5: Make predictions and Evaluation

The code evaluates the trained Random Forest Regression model:

Out of Bag Score in RandomForest

Bag score or OOB score is the type of validation technique that is mainly used in bagging algorithms to validate the bagging algorithm. Here a small part of the validation data is taken from the mainstream of the data and the predictions on the particular validation data are done and compared with the other results.

The main advantage that the OOB score offers is that here the validation data is not seen by the bagging algorithm and that is why the results on the OOB score are the true results that indicated the actual performance of the bagging algorithm.

To get the OOB score of the particular Random Forest algorithm, one needs to set the value “True” for the OOB_Score parameter in the algorithm.




# Evaluating the model
from sklearn.metrics import mean_squared_error, r2_score
 
# Access the OOB Score
oob_score = regressor.oob_score_
print(f'Out-of-Bag Score: {oob_score}')
 
# Making predictions on the same data or new data
predictions = regressor.predict(x)
 
# Evaluating the model
mse = mean_squared_error(y, predictions)
print(f'Mean Squared Error: {mse}')
 
r2 = r2_score(y, predictions)
print(f'R-squared: {r2}')

Output:

Out-of-Bag Score: 0.644879832593859
Mean Squared Error: 2647325000.0
R-squared: 0.9671801245316117

Step-6: Visualization

Now let’s visualize the results obtained by using the RandomForest Regression model on our salaries dataset.




import numpy as np
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
   
plt.scatter(X,y, color='blue') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='green') #plotting for predict points
   
plt.title("Random Forest Regression Results")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

Output: 

Step-7: Visualizing a Single Decision Tree from the Random Forest Model

The code visualizes one of the decision trees from the trained Random Forest model. Plots the selected decision tree, displaying the decision-making process of a single tree within the ensemble.




from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
 
# Assuming regressor is your trained Random Forest model
# Pick one tree from the forest, e.g., the first tree (index 0)
tree_to_plot = regressor.estimators_[0]
 
# Plot the decision tree
plt.figure(figsize=(20, 10))
plot_tree(tree_to_plot, feature_names=df.columns.tolist(), filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree from Random Forest")
plt.show()

Output:

Applications of Random Forest Regression

The Random forest regression has a wide range of real-world problems, including:

Advantages of Random Forest Regression

Disadvantages of Random Forest Regression

Conclusion

Random Forest Regression has become a powerful tool for continuous prediction tasks, with advantages over traditional decision trees. Its capability to handle high-dimensional data, capture complex relationships, and reduce overfitting has made it a popular choice for a variety of applications. Python’s scikit-learn library enables the implementation, optimization, and evaluation of Random Forest Regression models, making it an accessible and effective technique for machine learning practitioners.

Frequently Asked Question(FAQ’s)

1. What is Random Forest Regression Python?

Random Forest Regression Python is an ensemble learning method that uses multiple decision trees to make predictions. It is a powerful and versatile algorithm that is well-suited for regression tasks.

2. What is the use of random forest regression?

Random Forest Regression can be used to predict a variety of target variables, including prices, sales, customer churn, and more. It is a robust algorithm that is not easily overfitted, making it a good choice for real-world applications.

3. What is the difference between random forest and regression?

Random Forest is an ensemble learning method, while regression is a type of supervised learning algorithm. Random Forest uses multiple decision trees to make predictions, while regression uses a single model to make predictions.

4. How do you tune the hyperparameters of Random Forest Regression?

There are several methods for tuning the hyperparameters of Random Forest Regression, such as:

  • Grid search: Grid search involves systematically trying different combinations of hyperparameter values to find the best combination.
  • Random search: Random search randomly samples different combinations of hyperparameter values to find a good combination.

5. Why is random forest better than regression?

Random Forest is generally more accurate and robust than regression. It is also less prone to overfitting, which means that it is more likely to generalize well to new data.


Article Tags :