Open In App

How K-Fold Prevents overfitting in a model?

Last Updated : 02 Apr, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In machine learning, accurately processing how well a model performs and whether it can handle new data is crucial. Yet, with limited data or concerns about generalization, traditional methods of evaluation may not cut it. That’s where cross-validation steps in. It’s a method that rigorously tests predictive models by splitting the data, training on one part, and testing on another. Among these methods, K-Fold Cross-validation shines as a reliable and popular choice.

In this article, we’ll look at the K-Fold cross-validation approach and how it helps to reduce overfitting in models.

What is Cross validation?

A method for evaluating a predictive model’s effectiveness and capacity for generalization is called cross-validation. The dataset is divided into subsets, the model is fitted to one of the subsets (the training set), and the model is assessed on the complementary subset (the validation set). The performance numbers are averaged over the course of several rounds of this operation, each with a distinct split.

There are various approaches to cross-validation; K-Fold Cross-validation is one of the more well-known techniques.

What is K-Fold Cross validation?

K-Fold Cross-validation is a technique used in machine learning to assess the performance and generalizability of a model. The basic idea is to partition the dataset into “K” subsets (folds) of approximately equal size. The model is trained K times, each time using K-1 folds for training and the remaining fold for validation. This process is repeated K times, with a different fold used as the validation set in each iteration.

K-Fold Cross-validation helps in obtaining a more reliable estimate of a model’s performance by reducing the impact of the specific data split on the evaluation. It is particularly useful when the dataset is limited or when there is a concern about the randomness of the data partitioning.

Common choices for K include 5, 10, or sometimes even higher values, depending on the size of the dataset and the computational resources available. In the extreme case where K equals the total number of samples in the dataset, it is called “Leave-One-Out Cross-validation” (LOOCV). However, LOOCV can be computationally expensive and might not be practical for large datasets.

The dataset is divided into k equal-sized partitions at random for k fold cross validation. For greater randomization, D may occasionally be shuffled before to cross validation. We usually have k = 2, 5, 10 (10 is most common). For D = 250 and K = 5, each fold will contains 50 data.

Steps for K-Fold Cross-validation are as follows:

  1. Shuffled data for randomization.
  2. Divide the dataset into K subsets or folds.
  3. Train-Validation Loop: For each iteration:
    • Use K-1 folds for training the model.
    • Use the remaining fold for validation.
  4. Evaluate the model’s performance on each validation set using a predefined metric (e.g., accuracy, precision, recall, F1 score).
  5. Calculate the average performance across all K iterations.

What is Overfitting?

Overfitting happens when a machine learning model learns the training data so well that it detects noise or random oscillations in the data as meaningful patterns. This can result in poor performance when the model is applied to new, previously unseen data since it does not generalize properly.

Overfitting can be reduced by using:

  • Regularization
  • Cross validation
  • Early stopping
  • Dropout

K-Fold Implementation to the Model

Let’s see the difference on the model prediction while utilizing K-Fold cross validation versus not utilizing it. For this, I will utilize california_housing_test.csv.

Step 1: Import Necessary Libraries

First, we need to import the relevant libraries.

Python3
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

Step 2: Load the dataset

Python3
df = pd.read_csv("/content/sample_data/california_housing_test.csv")
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

median_house_value is our target and rest of features are input columns

Step 3: Preprocessing the dataset

Python3
label_encoder = LabelEncoder()

# Fit and transform the "ocean_proximity" column
df['ocean_proximity_encoded'] = label_encoder.fit_transform(df['ocean_proximity'])
df.drop('ocean_proximity',axis=1,inplace=True)
df['total_bedrooms'] = df['total_bedrooms'].ffill()
df.head()

Output:

    longitude    latitude    housing_median_age    total_rooms    total_bedrooms    population    households    median_income    median_house_value    ocean_proximity_encoded
0    -122.23    37.88    41.0    880.0    129.0    322.0    126.0    8.3252    452600.0    3
1    -122.22    37.86    21.0    7099.0    1106.0    2401.0    1138.0    8.3014    358500.0    3
2    -122.24    37.85    52.0    1467.0    190.0    496.0    177.0    7.2574    352100.0    3
3    -122.25    37.85    52.0    1274.0    235.0    558.0    219.0    5.6431    341300.0    3
4    -122.25    37.85    52.0    1627.0    280.0    565.0    259.0    3.8462    342200.0    3


Step 4: Splitting the dataset

Python3
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

Defining Model: Without K-Fold cross validation

Python3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
score = r2_score(y_test, y_pred)

print(f"R2 Score: {score}")

Output:

R2 Score: 0.6114554518898516

Defining Model: With K-Fold cross validation

This code implements K-Fold Cross-validation for a linear regression model where the target variable is median_house_value.

  • Number of folds k are defined to be 5 and initializes a KFold object ‘kf’ with 5 splits, shuffling the data and fixing the random state for reproducibility.
  • Next, the code iterates over each fold using a for loop. For each fold, it splits the data into training and testing sets using the indices provided by kf.split(X).
  • Finally, the code calculates the average R2 score across all folds by summing up the scores and dividing by the number of folds.
Python3
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)
scores = []

# Iterate over the splits
for fold, (train_index, test_index) in enumerate(kf.split(X)):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Initialize and train the model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Evaluate the model
    y_pred = model.predict(X_test)
    score = r2_score(y_test, y_pred)
    scores.append(score)
    
    print(f"Fold {fold+1} R2 Score: {score}")

# Calculate the average score
average_score = sum(scores) / len(scores)
print(f"Average R2 Score: {average_score}")

Output:

Fold 1 R2 Score: 0.6114554518898566
Fold 2 R2 Score: 0.6425719794066727
Fold 3 R2 Score: 0.6382892378835952
Fold 4 R2 Score: 0.6654790505178491
Fold 5 R2 Score: 0.6057229383411187
Average R2 Score: 0.6327037316078185

With k-fold cross-validation, we evaluate the model numerous times on distinct subsets of the data, resulting in a more trustworthy estimate of performance and aiding in the detection of overfitting or model instability. We only assess the model’s performance on one split of the data without cross-validation.

The R2 score in the case above is 0.61 when cross validation using K-Fold is used.

  • Without cross-validation, an R2 score of 0.61 indicates that 61% of the variance is explained by the model.
  • A somewhat improved performance is shown by an R2 score of 0.63 using cross-validation, suggesting the model’s generalizability across various data splits.

How K-Fold is reducing overfitting in the Model?

K-fold cross-validation reduces model overfitting through a variety of mechanisms:

  1. More robust evaluation: By splitting the dataset into numerous folds and averaging the performance measures across these folds, k-fold cross-validation delivers a more reliable assessment of the model’s performance. This reduces the impact of variability in training and testing data splits, allowing for more accurate assessments of the model’s generalization capabilities.
  2. Reduced dependency on a single train-test split: In traditional train-test splitting, the model’s performance can be heavily influenced by the specific random split of the data. K-fold cross-validation addresses this issue by repeatedly partitioning the data into separate train-test sets, allowing for a more comprehensive evaluation of the model’s performance across various subsets of the data.
  3. Use of all data for training and testing: In k-fold cross-validation, each data point is used for both training and testing at least once. This guarantees that the model is tested on a wide range of data points, allowing for a more thorough evaluation of its generalization capacity. By using all available data for both training and testing, k-fold cross-validation helps to decrease the bias that can result from a single train-test split.
  4. Regularization parameter tuning: K-fold cross-validation can also be used to tune hyperparameters, such as the regularization parameter in logistic regression or support vector machines. By iteratively training the model on multiple subsets of the data and evaluating its performance, k-fold cross-validation aids in identifying the appropriate hyperparameters that balance model complexity with generalization ability, reducing overfitting.

K-Fold Cross validation: FAQs

What are benefits of using K-Fold cross validation?

A more rigorous review of the model’s performance. Reduced reliance on a single train-test split, resulting in more trustworthy estimates of the model’s generalization capabilities. The use of all data for training and testing ensures a thorough evaluation of the model. Ability to adjust hyperparameters and improve model performance.

What should I take value of k?

The number of folds (k) used in K-Fold cross-validation is determined by several factors, including dataset size and computational capabilities. Common options for k are 5 and 10, however you can experiment with different values to determine what works best for your particular dataset and model.

When to use K-Fold Cross validation?

K-Fold cross-validation is frequently used during the model construction and evaluation phases to examine machine learning models’ performance and generalization capabilities. It is especially beneficial when working with tiny datasets or ensuring that your model generalizes adequately to new, previously unknown data.

What are limitations of K-Fold cross validation?

While K-Fold cross-validation is an effective technique, it can be computationally expensive, particularly for large datasets or sophisticated models. Furthermore, K-Fold cross-validation may not be suited for time-series data or datasets with dependencies between data points. It is critical to examine these criteria while selecting whether to employ K-Fold cross-validation.

Alternate of K-Fold Cross validation?

Leave-One-Out (LOO) cross-validation, Stratified K-Fold cross-validation and more.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads