Open In App

Learning Model Building in Scikit-learn

Scikit-learn has emerged as a powerful and user-friendly Python library. Its simplicity and versatility make it a better choice for both beginners and seasoned data scientists to build and implement machine learning models. In this article, we will explore about Sklearn.

Pre-requisite: Getting started with machine learning 

What is Scikit-learn?

Scikit-learn is an open-source Python library that implements a range of machine learning, pre-processing, cross-validation, and visualization algorithms using a unified interface. It is an open-source machine-learning library that provides a plethora of tools for various machine-learning tasks such as Classification, Regression, Clustering, and many more.

Installation of Scikit- learn

The latest version of Scikit-learn is 1.1 and it requires Python 3.8 or newer.

Scikit-learn requires:  

Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip: 

!pip install -U scikit-learn

Let us get started with the modeling process now.

Step 1: Load a Dataset

A dataset is nothing but a collection of data. A dataset generally has two main components: 

Loading exemplar dataset: scikit-learn comes loaded with a few example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression. 

Given below is an example of how one can load an exemplar dataset: 

# load the iris dataset as an example 
from sklearn.datasets import load_iris 
iris = load_iris() 
  
# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 
  
# store the feature and target names 
feature_names = iris.feature_names 
target_names = iris.target_names 
  
# printing features and target names of our dataset 
print("Feature names:", feature_names) 
print("Target names:", target_names) 
  
# X and y are numpy arrays 
print("\nType of X is:", type(X)) 
  
# printing first 5 input rows 
print("\nFirst 5 rows of X:\n", X[:5])

Output: 

Feature names: ['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
Type of X is:
First 5 rows of X:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]

Loading external dataset: Now, consider the case when we want to load an external dataset. For this purpose, we can use the pandas library for easily loading and manipulating datasets.

To install pandas, use the following pip command:  

! pip install pandas

In pandas, important data types are:

Note: The CSV file used in the example below can be downloaded from here: weather.csv

import pandas as pd 
  
# reading csv file 
data = pd.read_csv('weather.csv') 
  
# shape of dataset 
print("Shape:", data.shape) 
  
# column names 
print("\nFeatures:", data.columns) 
  
# storing the feature matrix (X) and response vector (y) 
X = data[data.columns[:-1]] 
y = data[data.columns[-1]] 
  
# printing first 5 rows of feature matrix 
print("\nFeature matrix:\n", X.head()) 
  
# printing first 5 values of response vector 
print("\nResponse vector:\n", y.head())

Output: 

Shape: (366, 22)
Features: Index(['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
'Temp3pm', 'RainToday', 'RISK_MM', 'RainTomorrow'],
dtype='object')
Feature matrix:
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir \
0 8.0 24.3 0.0 3.4 6.3 NW
1 14.0 26.9 3.6 4.4 9.7 ENE
2 13.7 23.4 3.6 5.8 3.3 NW
3 13.3 15.5 39.8 7.2 9.1 NW
4 7.6 16.1 2.8 5.6 10.6 SSE
WindGustSpeed WindDir9am WindDir3pm WindSpeed9am ... Humidity9am \
0 30.0 SW NW 6.0 ... 68
1 39.0 E W 4.0 ... 80
2 85.0 N NNE 6.0 ... 82
3 54.0 WNW W 30.0 ... 62
4 50.0 SSE ESE 20.0 ... 68
Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am \
0 29 1019.7 1015.0 7 7 14.4
1 36 1012.4 1008.4 5 3 17.5
2 69 1009.5 1007.2 8 7 15.4
3 56 1005.5 1007.0 2 7 13.5
4 49 1018.3 1018.5 7 7 11.1
Temp3pm RainToday RISK_MM
0 23.6 No 3.6
1 25.7 Yes 3.6
2 20.2 Yes 39.8
3 14.1 Yes 2.8
4 15.4 Yes 0.0
[5 rows x 21 columns]
Response vector:
0 Yes
1 Yes
2 Yes
3 Yes
4 No
Name: RainTomorrow, dtype: object

Step 2: Splitting the Dataset

One important aspect of all machine learning models is to determine their accuracy. Now, in order to determine their accuracy, one can train the model using the given dataset and then predict the response values for the same dataset using that model and hence, find the accuracy of the model. 
But this method has several flaws in it, like: 

A better option is to split our data into two parts: the first one for training our machine learning model, and the second one for testing our model. 

To summarize

Advantages of train/test split

Consider the example below: 

# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()

# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target

# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

# printing the shapes of the new X objects
print("X_train Shape:",  X_train.shape)
print("X_test Shape:", X_test.shape)

# printing the shapes of the new y objects
print("Y_train Shape:", y_train.shape)
print("Y_test Shape: ",y_test.shape)

Output: 

X_train Shape: (90, 4)
X_test Shape: (60, 4)
Y_train Shape: (90,)
Y_test Shape: (60,)

The train_test_split function takes several arguments which are explained below:  

Step 3: Training the Model

Now, it's time to train some prediction models using our dataset. Scikit-learn provides a wide range of machine learning algorithms that have a unified/consistent interface for fitting, predicting accuracy, etc.
The example given below uses KNN (K nearest neighbors) classifier.

Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only. 

Now, consider the example below: 

# load the iris dataset as an example
from sklearn.datasets import load_iris
iris = load_iris()

# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target

# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

# training the model on training set
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# making predictions on the testing set
y_pred = knn.predict(X_test)

# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("KNN model accuracy", metrics.accuracy_score(y_test, y_pred))

# making prediction for out of sample data
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = knn.predict(sample)
pred_species = [iris.target_names[p] for p in preds]
print("Predictions", pred_species)

Output: 

kNN model accuracy: 0.983333333333
Predictions: ['versicolor', 'virginica']

Important points to note from the above code:  

 knn = KNeighborsClassifier(n_neighbors=3)
 knn.fit(X_train, y_train)
 y_pred = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = knn.predict(sample)

Features of Scikit-learn

Benefits of using Scikit-learn Libraries

Conclusion

Scikit-learn stands as stone in the field of machine learning, providing a straightforward yet powerful toolset for building and deploying models. Whether you are a beginner explore the basics or an experienced data scientist tackle complex problems.  

Article Tags :