Evaluation is always good in any field right! In the case of machine learning, it is best the practice. In this post, I will almost cover all the popular as well as common metrics used for machine learning.
- Confusion Matrix
- Classification Accuracy.
- Logarithmic loss.
- Area under Curve.
- F1 score.
- Mean Absolute Error.
- Mean Squared Error.
It creates a N X N matrix, where N is the number of classes or categories that are to be predicted. Here we have N = 2, so we get 2 X 2 matrix. Suppose there is a problem for our practice which is a binary classification. Samples of that classification belong to either Yes or No. So, we build our classifier which will predict the class for new input sample. After that, we have tested our model with 165 samples, and we get the following result.
There are 4 terms you should keep in mind:
- True Positives: It is the case where we predicted Yes and the real output was also yes.
- True Negatives: It is the case where we predicted No and the real output was also No.
- False Positives: It is the case where we predicted Yes but it was actually No.
- False Negatives: It is the case where we predicted No but it was actually Yes.
Accuracy of the matrix is always calculated by taking average values present in the main diagonal i.e.
Classification accuracy is the accuracy we generally mean, Whenever we use the term accuracy. We calculate this by calculating the ratio of correct predictions by a total number of input Samples.
It works great if there are an equal number of samples for each class. For example, we have 90% sample of class A and 10% sample of class B in our training set. Then, our model will predict with the accuracy of 90% by predicting all the training samples belongs to class A. If we test the same model with a test set of 60% from class A and 40% from class B. Then the accuracy will fall, and we will get an accuracy of 60%.
Classification accuracy is good but it gives False Positive sense of achieving high accuracy. The problem arises due to the possibility of miss-classification of minor class samples are very high.
It is also known as Log loss. Its basic working propaganda is by penalizing the false (False Positive) classification. It usually works well with multi-class classification. Working of Log loss, the classifier should assign a probability for each and every class of all the samples. If there are N samples belong to M class, then we calculate the Log loss in this way :
Now the Terms,
- y_ij indicates whether sample i belongs to class j.
- p_ij – the probability of sample i belongs to class j.
- Rang of log loss is [0,?). When the log loss is near 0 it indicates high accuracy and when away from zero then, it indicates lower accuracy.
- Let me give you a bonus point, minimizing log loss gives you higher accuracy for the classifier.
Area Under Curve (A U C ):
It is one of the widely used metrics and basically used for binary classification. A U C of a classifier is defined as the probability of a classifier that will rank a randomly chosen positive example higher than a negative example. Before going into A U C more, let me make you comfortable with few basic terms.
True positive rate: Also called or termed as sensitivity. True Positive Rate is considered as a portion of positive data points which are correctly considered as positive, with respect to all data points those are positives.
True Negative Rate: Also called or termed as specificity. False Negative Rate is considered as a portion of negative data points which are correctly considered as negative, with respected to all data points those are negatives.
False-positive Rate: False Negative Rate is considered as a portion of negative data points which are mistakenly considered as negative, with respected to all data points those are negatives.
False Positive Rate and True Positive Rate both have values in the range [0, 1]. Now the thing is what is A U C then? So, A U C is a curve plotted between False Positive Rate Vs True Positive Rate at all different data points with a range of [0, 1]. Greater the value of AUCC better the performance of the model.
It is a harmonic mean between recall and precision. Its range is [0,1]. This metric usually tells us how precise (It correctly classifies how many instances) and robust (does not miss any significant number of instances) our classifier is.
It is used to measure the test’s accuracy
Lower recall and higher precision give you great accuracy but then it misses a large number of instances. More the F1 score better will be performance. It can be expressed mathematically in this way :
Mean Absolute Error:
It is the average distance between Predicted and original values. Basically it gives how we have predicted from the actual output. However, there is one limitation i.e. it doesn’t give any idea about the direction of the error which is whether we are under-predicting or over-predicting our data. It can be represented mathematically in this way :
Mean Squared Error:
It is similar to mean absolute error but the difference is it takes the square of average of between predicted and original values. The main advantage to take this metric is here, it is easier to calculate the gradient whereas in case of mean absolute error it takes complicated programming tools to calculate the gradient. By taking the square of errors it pronounces larger error more than smaller error, we can focus more on larger error. It can be expressed mathematically in this way :