Y Scrambling is a method that one can use in order to test whether the predictions made by the model aren’t made just by chance. It is used in the validation of multi linear regression QSPR models. It has many names Y-Scrambling, Y-Randomization, Y-Permutation, etc. This process is amazingly simple to execute, and we’ll learn about it in detail.
Steps for Y-Scrambling:
The intuition behind Y-Scrambling is very simple first you train your model over the original data and note its performance metric. The next thing you do is to shuffle the target column so that the correct feature-target pairs are now replaced with the new incorrect feature-target pairs. Now you need to train your model over this data and note down its performance metric. You re-shuffle the target column and repeat the steps. What we expect is that the model performs well over the original data and poorly on the shuffled data. If that’s not the case and the metric doesn’t vary much then that means the predictions aren’t robust. The step-wise process is as follows:-
- Train Model over original feature-target pair.
- Note the performance metric.
- Repeat till a certain amount of iteration
- Shuffle the target column.
- Train model over new feature-target pair.
- Note the performance metric
- Analyze the metrics of original pairs with the shuffled ones.
For this tutorial, I’ll be using the Boston house pricing dataset present in sklearn’s datasets module which will return a dictionary in which features will be present under data key and targets under target key. Let’s start by importing the data:-
import numpy as np from sklearn.datasets import load_boston data = load_boston() X = data.data Y = data.target
Now that we have the features and target let’s execute the first 2 steps of Y-scrambling i.e. training the model and noting the performance metric.
from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score reg = LinearRegression() reg.fit(X,Y) ypred = reg.predict(X) original_r2 = r2_score(Y,ypred) print(original_r2)
The original_r2 came out to be 0.74064. With this, we’ve completed our first 2 steps. Now we’ll proceed to the next step i.e. shuffling the target array, training the model and storing the performance metric in a loop. These steps have to be repeated for certain no. of iteration which I took as 100 for this tutorial. One thing to note is that
shuffled_r2 =  from tqdm.notebook import trange for i in trange(100): np.random.shuffle(Y) reg = LinearRegression() reg.fit(X,Y) ypred = reg.predict(X) shuffled_r2.append(r2_score(Y,ypred))
If you print shuffled_r2 you’ll see that the model performed awful. The first few values of shuffled_r2 are as follows:-
>>> shuffled_r2[:20] [0.015336761335013271, 0.0176654793204013, 0.01740534118134418, 0.02319807700450416, 0.018487786525668626, 0.02251746334707183, 0.03766952947632973, 0.01854475963361435, 0.03570134149232318, 0.022607830815118635, 0.016603896471999002, 0.0386838401376941, 0.024355424374905343, 0.04058673452547956, 0.014581835385169217, 0.03193842111822809, 0.03366492627548756, 0.02274120932669821, 0.04335824299249236, 0.02665799106621214]