Feature Engineering in R Programming
Feature engineering is the most important technique used in creating machine learning models. Feature Engineering is a basic term used to cover many operations that are performed on the variables(features)to fit them into the algorithm. It helps in increasing the accuracy of the model thereby enhances the results of the predictions. Feature Engineered machine learning models perform better on data than basic machine learning models. The following aspects of feature engineering are as follows:
- Feature Scaling: It is done to get the features on the same scale( for eg. Euclidean distance).
- Feature Transformation: It is done to normalize the data(feature) by a function.
- Feature Construction: It is done to create new features based on original descriptors to improve the accuracy of the predictive model.
- Feature Reduction. : It is done to improve the statistical distribution and accuracy of the predictive model.
The feature Construction method helps in creating new features in the data thereby increasing model accuracy and overall predictions. It is of two types:
- Binning: Bins are created for continuous variables.
- Encoding: Numerical variables or features are formed from categorical variables.
Binning is done to create bins for continuous variables where they are converted to categorical variables. There are two types of binning: Unsupervised and Supervised.
- Unsupervised Binning involves Automatic and Manual binning. In Automatic Binning, bins are created without human interference and are created automatically. In Manual Binning, bins are created with human interference and we specify where the bins to be created.
- Supervised Binning involves creating bins for the continuous variable while taking the target variable into the consideration also.
Encoding is the process in which numerical variables or features are created from categorical variables. It is a widely used method in the industry and in every model building process. It is of two types: Label Encoding and One-hot Encoding.
- Label Encoding involves assigning each label a unique integer or value based on alphabetical ordering. It is the most popular and widely used encoding.
- One-hot Encoding involves creating additional features or variables on the basis of unique values in categorical variables i.e. every unique value in the category will be added as a new feature.
Implementation in R
BigMart dataset consists of 1559 products across 10 stores in different cities. Certain attributes of each product and store have been defined. It consists of 12 features i.e Item_Identifier( is a unique product ID assigned to every distinct item), Item_Weight(includes the weight of the product), Item_Fat_Content(describes whether the product is low fat or not), Item_Visibility(mentions the percentage of the total display area of all products in a store allocated to the particular product), Item_Type(describes the food category to which the item belongs), Item_MRP(Maximum Retail Price (list price) of the product), Outlet_Identifier(unique store ID assigned. It consists of an alphanumeric string of length 6), Outlet_Establishment_Year(mentions the year in which store was established), Outlet_Size(tells the size of the store in terms of ground area covered), Outlet_Location_Type(tells about the size of the city in which the store is located), Outlet_Type(tells whether the outlet is just a grocery store or some sort of supermarket) and Item_Outlet_Sales( sales of the product in the particular store).
Performing Feature Engineering on the dataset
Using the Feature Construction method on the dataset which includes 12 features with 1559 products across 10 stores in different cities.
- Model model_xgb:
The XgBoost model consists of 21 features with the objective of regression linear, eta is 0.01, gamma is 1, max_depth is 6, colsample_bytree = 0.5, and silent is 1.
- Variable Importance plot:
price_per_unit_wt is the second most important variable or feature for the predictive model followed by Outlet_Years being the sixth most important variable or feature for the predictive model. Item_category, Item_Type_new features played a major role in improving the predictive model and thus improving model accuracy. So, Feature Engineering is the most important method for building an efficient, scalable, and accurate predictive model.