A **Validation Curve **is an important diagnostic tool that shows the sensitivity between to changes in a Machine Learning model’s accuracy with change in some parameter of the model.

A validation curve is typically drawn between some parameter of the model and the model’s score. Two curves are present in a validation curve – one for the training set score and one for the cross-validation score. By default, the function for validation curve, present in the scikit-learn library performs 3-fold cross-validation.

A validation curve is used to evaluate an existing model based on hyper-parameters and is not used to tune a model. This is because, if we tune the model according to the validation score, the model may be biased towards the specific data against which the model is tuned; thereby, not being a good estimate of the generalization of the model.

**Interpreting a Validation Curve **

Interpreting the results of a validation curve can sometimes be tricky. Keep the following points in mind while looking at a validation curve :

- Ideally, we would want both the validation curve and the training curve to look as similar as possible.
- If both scores are low, the model is likely to be
. This means either the model is too simple or it is informed by too few features. It could also be the case that the model is regularized too much.**underfitting** - If the training curve reaches a high score relatively quickly and the validation curve is lagging behind, the model is
This means the model is very complex and there is too little data; or it could simply mean there is too little data.**overfitting.** - We would want the value of the parameter where the training and validation curves are closest to each other.

**Implementation of Validation Curves in Python : **

For the sake of simplicity, in this example, we will use the very popular, ‘*digits*‘ dataset. More Information about this dataset is available in the link below:

https://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image

For this example, we will use k-Nearest Neighbour classifier and will plot the accuracy of the model on the training set score and the cross-validation score against the value of ‘k’, i.e., the number of neighbours to consider.

**Code: Python code to implement 5-fold cross-validation and to test the value of ‘k’ from 1 to 10.**

`# Import Required libraries ` `import` `matplotlib.pyplot as plt ` `import` `numpy as np ` `from` `sklearn.datasets ` `import` `load_digits ` `from` `sklearn.neighbors ` `import` `KNeighborsClassifier ` `from` `sklearn.model_selection ` `import` `validation_curve ` ` ` `# Loading dataset ` `dataset ` `=` `load_digits() ` ` ` `# X contains the data and y contains the labels ` `X, y ` `=` `dataset.data, dataset.target ` ` ` `# Setting the range for the paramter (from 1 to 10) ` `parameter_range ` `=` `np.arange(` `1` `, ` `10` `, ` `1` `) ` ` ` `# Calculate accuracy on training and test set using the ` `# gamma parameter with 5-fold cross validation ` `train_score, test_score ` `=` `validation_curve(KNeighborsClassifier(), X, y, ` ` ` `param_name ` `=` `"n_neighbors"` `, ` ` ` `param_range ` `=` `parameter_range, ` ` ` `cv ` `=` `5` `, scoring ` `=` `"accuracy"` `) ` ` ` `# Calculating mean and standard deviation of training score ` `mean_train_score ` `=` `np.mean(train_score, axis ` `=` `1` `) ` `std_train_score ` `=` `np.std(train_score, axis ` `=` `1` `) ` ` ` `# Calculating mean and standard deviation of testing score ` `mean_test_score ` `=` `np.mean(test_score, axis ` `=` `1` `) ` `std_test_score ` `=` `np.std(test_score, axis ` `=` `1` `) ` ` ` `# Plot mean accuracy scores for training and testing scores ` `plt.plot(parameter_range, mean_train_score, ` ` ` `label ` `=` `"Training Score"` `, color ` `=` `'b'` `) ` `plt.plot(parameter_range, mean_test_score, ` ` ` `label ` `=` `"Cross Validation Score"` `, color ` `=` `'g'` `) ` ` ` `# Creating the plot ` `plt.title(` `"Validation Curve with KNN Classifier"` `) ` `plt.xlabel(` `"Number of Neighbours"` `) ` `plt.ylabel(` `"Accuracy"` `) ` `plt.tight_layout() ` `plt.legend(loc ` `=` `'best'` `) ` `plt.show()` |

*chevron_right*

*filter_none*

**Output:**

From this graph, we can observe that

*‘k’ = 2*would be the ideal value of k. As the number of neighbours (k) increases, both the accuracy of Training Score as well as the cross-validation score decreases.