Feature Engineering in R Programming

Last Updated : 20 Mar, 2023

Feature engineering is the process of transforming raw data into features that can be used in a machine-learning model. In R programming, feature engineering can be done using a variety of built-in functions and packages.

One common approach to feature engineering is to use the dplyr package to manipulate and summarize data. This package provides functions such as “select()” to select specific columns from a data frame, “filter()” to filter rows based on certain criteria, and “group_by()” to group data by one or more variables and perform aggregate calculations.

Another popular package for feature engineering in R is the tidyr package, which provides functions to reshape and restructure data, such as “gather()” and “spread()” to reshape wide and long format data respectively.

Additionally, you can use functions from base R such as “aggregate()”, “apply()”, “tapply()” to perform calculations on groups of data, or “as.factor()” to convert variables into factors.

Additionally, you can use libraries like caret, which provides various functions to preprocess data, such as normalization, scaling, encoding, and feature selection.

It’s important to note that Feature engineering is an iterative process that requires a good understanding of the data and the problem you are trying to solve, and it’s a crucial step in building a good machine-learning model.

Feature engineering is the most important technique used in creating machine learning models. Feature Engineering is a basic term used to cover many operations that are performed on the variables(features)to fit them into the algorithm. It helps in increasing the accuracy of the model thereby enhancing the results of the predictions. Feature Engineered machine learning models perform better on data than basic machine learning models.

The following aspects of feature engineering are as follows:

Feature Scaling: It is done to get the features on the same scale( for eg. Euclidean distance).
Feature Transformation: It is done to normalize the data(feature) by a function.
Feature Construction: It is done to create new features based on original descriptors to improve the accuracy of the predictive model.
Feature Reduction. : It is done to improve the statistical distribution and accuracy of the predictive model.

Theory

The feature Construction method helps in creating new features in the data thereby increasing model accuracy and overall predictions. It is of two types:

Binning: Bins are created for continuous variables.
Encoding: Numerical variables or features are formed from categorical variables.

Binning

Binning is done to create bins for continuous variables where they are converted to categorical variables. There are two types of binning: Unsupervised and Supervised.

Unsupervised Binning involves Automatic and Manual binning. In Automatic Binning, bins are created without human interference and are created automatically. In Manual Binning, bins are created with human interference and we specify where the bins to be created.
Supervised Binning involves creating bins for the continuous variable while taking the target variable into the consideration also.

Encoding

Encoding is the process in which numerical variables or features are created from categorical variables. It is a widely used method in the industry and in every model building process. It is of two types: Label Encoding and One-hot Encoding.

Label Encoding involves assigning each label a unique integer or value based on alphabetical ordering. It is the most popular and widely used encoding.
One-hot Encoding involves creating additional features or variables on the basis of unique values in categorical variables i.e. every unique value in the category will be added as a new feature.

Implementation in R

The Dataset

BigMart dataset consists of 1559 products across 10 stores in different cities. Certain attributes of each product and store have been defined. It consists of 12 features i.e Item_Identifier( is a unique product ID assigned to every distinct item), Item_Weight(includes the weight of the product), Item_Fat_Content(describes whether the product is low fat or not), Item_Visibility(mentions the percentage of the total display area of all products in a store allocated to the particular product), Item_Type(describes the food category to which the item belongs), Item_MRP(Maximum Retail Price (list price) of the product), Outlet_Identifier(unique store ID assigned. It consists of an alphanumeric string of length 6), Outlet_Establishment_Year(mentions the year in which store was established), Outlet_Size(tells the size of the store in terms of ground area covered), Outlet_Location_Type(tells about the size of the city in which the store is located), Outlet_Type(tells whether the outlet is just a grocery store or some sort of supermarket) and Item_Outlet_Sales( sales of the product in the particular store).

R

# Loading data  
train = fread("Train_UWu5bXk.csv")  
test = fread("Test_u94Q5KV.csv")  
 
# Structure  
str(train)

Output:

Output

Performing Feature Engineering on the dataset

Using the Feature Construction method on the dataset which includes 12 features with 1559 products across 10 stores in different cities.

R

# Loading packages
library(data.table) # used for reading and manipulation of data
library(dplyr)      # used for data manipulation and joining
library(ggplot2)    # used for plotting 
library(caret)      # used for modeling
library(e1071)      # used for removing skewness
library(corrplot)   # used for making correlation plot
library(xgboost)    # used for building XGBoost model
library(cowplot)    # used for combining multiple plots 
 
# Importing datasets
train = fread("Train_UWu5bXk.csv")  
test = fread("Test_u94Q5KV.csv")  
 
# Structure of dataset
str(train)
 
# Setting test dataset 
# Combining datasets 
# add Item_Outlet_Sales to test data 
test[, Item_Outlet_Sales := NA]  
combi = rbind(train, test) 
   
# Missing Value Treatment 
missing_index = which(is.na(combi$Item_Weight)) 
for(i in missing_index){ 
  item = combi$Item_Identifier[i] 
  combi$Item_Weight[i] = mean(combi$Item_Weight 
                         [combi$Item_Identifier == item],  
                         na.rm = T) 
} 
 
# Feature Engineering
# Feature Transformation
# Replacing 0 in Item_Visibility with mean
zero_index = which(combi$Item_Visibility == 0)
for(i in zero_index){
  item = combi$Item_Identifier[i]
  combi$Item_Visibility[i] = mean(
    combi$Item_Visibility[combi$Item_Identifier == item], 
    na.rm = T
  ) 
}
 
# Feature Construction
# Create a new feature 'Item_Type_new' 
perishable = c("Breads", "Breakfast", "Dairy", 
               "Fruits and Vegetables", "Meat", "Seafood")
non_perishable = c("Baking Goods", "Canned", "Frozen Foods",
                   "Hard Drinks", "Health and Hygiene",
                   "Household", "Soft Drinks")
 
combi[,Item_Type_new := ifelse(Item_Type %in% perishable, "perishable",
                               ifelse(Item_Type %in% non_perishable, 
                                      "non_perishable", "not_sure"))]
 
 
combi[,Item_category := substr(combi$Item_Identifier, 1, 2)]
 
combi$Item_Fat_Content[combi$Item_category == "NC"] = "Non-Edible"
 
# Years of operation of Outlets
combi[,Outlet_Years := 2013 - Outlet_Establishment_Year]
combi$Outlet_Establishment_Year = as.factor(combi$Outlet_Establishment_Year)
 
# Price per unit weight
combi[,price_per_unit_wt := Item_MRP/Item_Weight]
 
# Label Encoding
combi[,Outlet_Size_num := ifelse(Outlet_Size == "Small", 0,
                                 ifelse(Outlet_Size == "Medium", 1, 2))]
 
combi[,Outlet_Location_Type_num := ifelse(Outlet_Location_Type == "Tier 3", 0,
                                   ifelse(Outlet_Location_Type == "Tier 2", 1, 2))]
 
combi[, c("Outlet_Size", "Outlet_Location_Type") := NULL]
 
# One-hot Encoding
ohe = dummyVars("~.", data = combi[,-c("Item_Identifier", 
                                       "Outlet_Establishment_Year",
                                       "Item_Type")], fullRank = T)
ohe_df = data.table(predict(ohe, combi[,-c("Item_Identifier", 
                                           "Outlet_Establishment_Year", 
                                           "Item_Type")]))
 
combi = cbind(combi[,"Item_Identifier"], ohe_df)
 
# Removing Skewness
skewness(combi$Item_Visibility) 
skewness(combi$price_per_unit_wt)
 
combi[,Item_Visibility := log(Item_Visibility + 1)] 
combi[,price_per_unit_wt := log(price_per_unit_wt + 1)]
 
# Scaling and Centering data
# index of numeric features
num_vars = which(sapply(combi, is.numeric)) 
num_vars_names = names(num_vars)
 
combi_numeric = combi[,setdiff(num_vars_names,
                               "Item_Outlet_Sales"), with = F]
 
prep_num = preProcess(combi_numeric, method=c("center", "scale"))
combi_numeric_norm = predict(prep_num, combi_numeric)
 
# Transforming Features
combi[,setdiff(num_vars_names, "Item_Outlet_Sales") := NULL] 
 
combi = cbind(combi, combi_numeric_norm)
 
# Splitting data
train = combi[1:nrow(train)]
test = combi[(nrow(train) + 1):nrow(combi)]
 
# Removing Item_Outlet_Sales
test[,Item_Outlet_Sales := NULL]
 
# Model Building - xgboost
para_list = list(
        objective = "reg:linear",
        eta=0.01,
        gamma = 1,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.5
        )
 
# D Matrix
d_train = xgb.DMatrix(data = as.matrix(train[,-c("Item_Identifier", 
                                                 "Item_Outlet_Sales")]), 
                      label= train$Item_Outlet_Sales)
d_test = xgb.DMatrix(data = as.matrix(test[,-c("Item_Identifier")]))
 
# K-fold cross validation 
set.seed(123) # Setting seed
xgb_cv = xgb.cv(params = para_list, 
               data = d_train, 
               nrounds = 1000, 
               nfold = 5, 
               print_every_n = 10, 
               early_stopping_rounds = 30, 
               maximize = F)
 
# Training model 
model_xgb = xgb.train(data = d_train, 
                      params = para_list, 
                      nrounds = 428)
 
model_xgb
 
# Variable Importance Plot 
variable_imp = xgb.importance(feature_names = setdiff(names(train), 
                              c("Item_Identifier", "Item_Outlet_Sales")), 
                              model = model_xgb)
 
xgb.plot.importance(variable_imp)

Output:

Model model_xgb:

Output

The XgBoost model consists of 21 features with the objective of regression linear, eta is 0.01, gamma is 1, max_depth is 6, colsample_bytree = 0.5, and silent is 1.

Variable Importance plot:

Output

price_per_unit_wt is the second most important variable or feature for the predictive model followed by Outlet_Years being the sixth most important variable or feature for the predictive model. Item_category, Item_Type_new features played a major role in improving the predictive model and thus improving model accuracy. So, Feature Engineering is the most important method for building an efficient, scalable, and accurate predictive model.

Suggest improvement

Poisson Functions in R Programming

Adjusted Coefficient of Determination in R Programming

Share your thoughts in the comments

Getting Started With Machine Learning In R

Data Processing

Supervised Learning

Evaluation Metrics

Unsupervised Learning

Model Selection and Evaluation

Reinforcement Learning

Dimensionality Reduction

Advanced Topics

Feature Engineering in R Programming

Theory

Binning

Encoding

Implementation in R

The Dataset

R

Performing Feature Engineering on the dataset

R

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?