Y Scrambling for Model Validation

Last Updated : 16 Apr, 2021

Y Scrambling is a method that one can use in order to test whether the predictions made by the model aren’t made just by chance. It is used in the validation of multi linear regression QSPR models. It has many names Y-Scrambling, Y-Randomization, Y-Permutation, etc. This process is amazingly simple to execute, and we’ll learn about it in detail.

Steps for Y-Scrambling:

The intuition behind Y-Scrambling is very simple first you train your model over the original data and note its performance metric. The next thing you do is to shuffle the target column so that the correct feature-target pairs are now replaced with the new incorrect feature-target pairs. Now you need to train your model over this data and note down its performance metric. You re-shuffle the target column and repeat the steps. What we expect is that the model performs well over the original data and poorly on the shuffled data. If that’s not the case and the metric doesn’t vary much then that means the predictions aren’t robust. The step-wise process is as follows:-

Train Model over original feature-target pair.
Note the performance metric.
Repeat till a certain amount of iteration
- Shuffle the target column.
- Train model over new feature-target pair.
- Note the performance metric
Analyze the metrics of original pairs with the shuffled ones.

Implementing Y-Scrambling:

For this tutorial, I’ll be using the Boston house pricing dataset present in sklearn’s datasets module which will return a dictionary in which features will be present under data key and targets under target key. Let’s start by importing the data:-

import numpy as np
from sklearn.datasets import load_boston

data = load_boston()
X = data.data
Y = data.target

Now that we have the features and target let’s execute the first 2 steps of Y-scrambling i.e. training the model and noting the performance metric.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

reg = LinearRegression()
reg.fit(X,Y)

ypred = reg.predict(X)
original_r2 = r2_score(Y,ypred)
print(original_r2)

The original_r2 came out to be 0.74064. With this, we’ve completed our first 2 steps. Now we’ll proceed to the next step i.e. shuffling the target array, training the model and storing the performance metric in a loop. These steps have to be repeated for certain no. of iteration which I took as 100 for this tutorial. One thing to note is that

shuffled_r2 = []
from tqdm.notebook import trange
for i in trange(100):
    np.random.shuffle(Y)
    
    reg = LinearRegression()
    reg.fit(X,Y)
    
    ypred = reg.predict(X)
    shuffled_r2.append(r2_score(Y,ypred))

If you print shuffled_r2 you’ll see that the model performed awful. The first few values of shuffled_r2 are as follows:-

>>> shuffled_r2[:20]
[0.015336761335013271,
 0.0176654793204013,
 0.01740534118134418,
 0.02319807700450416,
 0.018487786525668626,
 0.02251746334707183,
 0.03766952947632973,
 0.01854475963361435,
 0.03570134149232318,
 0.022607830815118635,
 0.016603896471999002,
 0.0386838401376941,
 0.024355424374905343,
 0.04058673452547956,
 0.014581835385169217,
 0.03193842111822809,
 0.03366492627548756,
 0.02274120932669821,
 0.04335824299249236,
 0.02665799106621214]

Code:

Python3

import numpy as np 
from sklearn.datasets import load_boston 
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import r2_score 
  
# LOADING THE DATA 
data = load_boston() 
X = data.data 
Y = data.target 
  
#TRAINING OVER ORIGINAL TARGET 
reg = LinearRegression() 
reg.fit(X,Y) 
  
ypred = reg.predict(X) 
original_r2 = r2_score(Y,ypred) 
print(original_r2) 
  
# TRAINING OVER SHUFFLED TARGET 
shuffled_r2 = [] 
  
for i in range(100): 
    np.random.shuffle(Y) 
      
    reg = LinearRegression() 
    reg.fit(X,Y) 
      
    ypred = reg.predict(X) 
    shuffled_r2.append(r2_score(Y,ypred)) 
  
print(shuffled_r2[:20])

Suggest improvement

Identifying handwritten digits using Logistic Regression in PyTorch

How to Format date using strftime() in Python ?

Share your thoughts in the comments

Y Scrambling for Model Validation

Steps for Y-Scrambling:

Implementing Y-Scrambling:

Code:

Python3

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?