Random Forest Approach for Regression in R Programming

Random Forest approach is a supervised learning algorithm. It builds the multiple decision trees which are known as forest and glue them together to urge a more accurate and stable prediction. The random forest approach is similar to the ensemble technique called as Bagging. In this approach, multiple trees are generated by bootstrap samples from training data and then we simply reduce the correlation between the trees. Performing this approach increases the performance of decision trees and helps in avoiding overriding. In this article, let’s learn to use a random forest approach for regression in R programming.

Features of Random Forest

  • Aggregates many decision trees: A random forest is a collection of decision trees and thus, does not rely on a single feature and combines multiple predictions from each decision tree.
  • Prevents overfitting: With multiple decision trees, each tree draws a sample random data giving the random forest more randomness to produce much better accuracy than decision trees.

Advantages of Random Forest

  • Efficient: Random forests are much more efficient than decision trees while performing on large databases.
  • Highly accurate: Random forests are highly accurate as they are collection of decision trees and each decision tree draws sample random data and in result, random forests produces higher accuracy on prediction.
  • Efficient estimates of the test error: It makes efficient use of all predictive features and maintains accuracy even if the data is missing.

Disadvantages of Random Forest

  • Requires different number of levels: Being a collection of decision trees, random forest requires different number of levels for much accurate and biased prediction of the training model.
  • Requires a lot of memory: Training a large set of trees may require higher memory or parallelized memory.

Implementation of Random Forest Approach for Regression in R

The package randomForest in R programming is employed to create random forests. The forest it builds is a collection of decision trees. The function randomForest() is used to create and analyze random forests.

Syntax:
randomForest(formula, data)

Parameters:
formula: represents formula describing the model to be fitted
data: represents data frame containing the variables in the model

To know about more optional parameters, use command help(“randomForest”)



Example:

  • Step 1: Installing the required packages.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # Install the required package for function
    install.packages("randomForest")

    chevron_right

    
    

  • Step 2: Loading the required package.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # Load the library
    library(randomForest)

    chevron_right

    
    

  • Step 3: In this example, let’s use airquality dataset present in R. Print the data set.

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # Print the dataset
    print(head(airquality))

    chevron_right

    
    

    Output:

       Ozone Solar.R Wind Temp Month Day
    1    41     190  7.4   67     5   1
    2    36     118  8.0   72     5   2
    3    12     149 12.6   74     5   3
    4    18     313 11.5   62     5   4
    5    NA      NA 14.3   56     5   5
    6    28      NA 14.9   66     5   6
    
  • Step 4: Create random forest for regression

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # Create random forest for regression
    ozone.rf <- randomForest(Ozone ~ ., data = airquality, mtry = 3,
                             importance = TRUE, na.action = na.omit)

    chevron_right

    
    

  • Step 5: Print Regression Models

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # Print regression model
    print(ozone.rf)

    chevron_right

    
    

    Output:

    Call:
     randomForest(formula = Ozone ~ ., data = airquality, mtry = 3,      importance = TRUE, na.action = na.omit) 
                   Type of random forest: regression
                         Number of trees: 500
    No. of variables tried at each split: 3
    
              Mean of squared residuals: 296.4822
                        % Var explained: 72.98
    
  • Step 6: Plotting the graph between error vs number of trees

    filter_none

    edit
    close

    play_arrow

    link
    brightness_4
    code

    # Output to be present as PNG file 
    png(file = "randomForestRegression.png")
      
    # Plot the error vs the number of trees graph
    plot(ozone.rf)
      
    # Saving the file
    dev.off()

    chevron_right

    
    

    Output:

    graph-output




My Personal Notes arrow_drop_up

Blockchain Enthusiast

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.