Model Building for Data Analytics
Last Updated :
29 May, 2023
Prerequisite – Life Cycle Phases of Data Analytics
After formulating the problem and preprocessing the data accordingly. We select the type of model we should build for our model. Like if our problem requires our result to have higher explainability then we use models like Linear regression or decision tree but if our model requires to have higher accuracy then we build models like XGBOOST or Deep Neural Network.
Model Building In Data Analytics
Model building is an essential part of data analytics and is used to extract insights and knowledge from the data to make business decisions and strategies. In this phase of the project data science team needs to develop data sets for training, testing, and production purposes. These data sets enable data scientists to develop an analytical method and train it while holding aside some of the data for testing the model. Model building in data analytics is aimed at achieving not only high accuracy on the training data but also the ability to generalize and perform well on new, unseen data. Therefore, the focus is on creating a model that can capture the underlying patterns and relationships in the data, rather than simply memorizing the training data.
To do this we divide our dataset into two parts
- Training dataset
- Test dataset
Note: Based on the dataset quality and quantity of the data one may choose to divide his dataset into three parts training and testing and validation data.
Dividing The Dataset For Model Building
To divide the dataset we will use the Python sklearn library which helps us in dividing the dataset into training and testing datasets. Here we will choose the ratio by which we want to divide the dataset by default it 3:1 for training and testing.
Python code for creating and dividing the dataset
We will first create a random array of dimensions having 2 columns and 100 rows and convert it into a dataframe using pandas. After that, we will use the sklearn package to divide the dataframe into test and train datasets and also we will separate our dataset into dependent and independent variables.
Python3
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
data = np.random.randint(low = 10 , high = 100 ,
size = 2000 ).reshape( 1000 , 2 )
data = pd.DataFrame(data, columns = ( 'x' , 'y' ))
X = data[[ 'x' , 'y' ]]
y = np.random.rand( 1000 )
train_data_x, test_data_x, train_data_y, test_data_y = \
train_test_split(data, y, test_size = 0.25 )
|
Scaling The Dataset
Sacling the dataset is an important preprocessing step before feeding the to the outliers. there are several benefits of scaling the data theses are as:
- It prevents features with different scales from dominating the model like example suppose column A has data ranging from 1 to 1000 and column B has data ranging from 0 to 1 in that case column A can influence our model decision even if it is not an important feature. But after scaling all our columns comes in the similar range
- It speeds up our model convergence. Many optimization algorithms such as gradient descent are very sensitive to the scale of the data. By scaling data between 0 to 1 these algorithm converges faster.
Effect of scaling on Gradient Descent
- Scaling the dataset makes our model more robust to the outliers.
- Some algorithms like K-nearest neighbors (KNN), use the distance between data points to make predictions in this case if the columns have different scales then the distance can go higher.
Python code for scaling the columns
We will use StandardScaler object from the sklearn library to scale our independent features of the dataset.
Python3
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_data_x_scaled = scaler.fit_transform(train_data_x.to_numpy())
test_data_x = scaler.transform(test_data_x.to_numpy())
|
Modeling The Data
After scaling and splitting the data it has now become ready for fitting to the model. The choice of choosing model totally depends on our problem formulation. There are a variety of models present that we can choose from. However, before choosing the model first, we should identify these points in the data
- Whether our problem is a regression problem or a classification problem
- Whether we want a model which is more explainable or we want a model which has a higher accuracy
Python code for modeling the data
Since our target value is continuous so here we will consider it as a regression problem. For making it simple and explainable we will use the decision tree model.
Python3
from sklearn.tree import DecisionTreeRegressor,plot_tree
reg = DecisionTreeRegressor(min_samples_split = 4 ,
max_leaf_nodes = 10 )
reg.fit(train_data_x_scaled,train_data_y)
y_pred = reg.predict(test_data_x)
|
After making the model we evaluate the model on the evaluation matrix. In our case, we will mean square error for computing the accuracy of our model.
Python code for evaluation
Python3
from sklearn.metrics import mean_squared_error
print (mean_squared_error(y_pred,test_data_y))
|
Output:
0.109
Since our dataset was randomly generated mean square error of 661 is not bad. One good thing about the decision tree is that we can also see the decision that was made to model the data.
Plotting The Decision Graph
We can use the plot_tree function from the sklearn library to visualize on what basis the decision is made.
Python3
fig, axes = plt.subplots(nrows = 1 , ncols = 1 , figsize = ( 4 , 4 ), dpi = 800 )
plot_tree(reg, filled = True , ax = axes, fontsize = 2 )
plt.show()
|
Output:
Decision tree diagram for The model
Share your thoughts in the comments
Please Login to comment...