Open In App

What is the default threshold in Sklearn logistic regression?

Last Updated : 08 Jun, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Logistic regression is a popular machine-learning algorithm used for classification problems. It is commonly used in various applications such as marketing, finance, healthcare, etc. One of the critical parameters in logistic regression is the threshold, which determines the class label of the prediction. In this article, we will discuss the default threshold in Sklearn logistic regression and provide some code examples to illustrate its usage.

Understanding the Threshold in Logistic Regression:

The probability of the positive class is calculated using the following formula:

Probability of positive class = 1 / (1 + exponent(-z))

The input features are multiplied by model parameters and summed up with an intercept term to get the value of z. The value of z is then used in the formula to calculate the probability of the positive class.

z = (intercept term)+(coefficient 1 * input feature 1)+(coefficient 2 * input feature 2)+ 
... +(coefficient n * input feature n)
The intercept term and coefficients are learned during the training of the logistic regression
model.

The threshold value is used to make a binary classification decision based on the probability of the positive class. If the probability of the positive class is greater than or equal to the threshold value, then the model predicts the positive class. Otherwise, it predicts the negative class. 
In logistic regression, the model’s output is a probability value ranging from 0 to 1. The probability is then transformed into a binary class label by applying a threshold value. For example, if the threshold value is set to 0.5, the prediction is classified as class 1 if the probability is greater than or equal to 0.5, and class 0 if the probability is less than 0.5. However, in some cases, the default threshold value may not be optimal for the given problem. Therefore, it is essential to understand how to change the threshold value to improve the performance of the model.

The Default Threshold in Sklearn Logistic Regression: 
The default threshold value in Sklearn logistic regression is 0.5. This means that by default, the model will predict the class with the highest probability when the probability is greater than or equal to 0.5. The default threshold value can be changed by setting the threshold parameter of the “predict” method.

syntax of predict_proba() method:

Probability estimates.The returned estimates for all classes are ordered by the label of classes.

Parameters:
X : array-like of shape (n_samples, n_features). Vector to be scored, where n_samples is the number of samples and n_features is the number of features.

EXAMPLE: changing the default threshold 

STEP 1: Importing the necessary packages: 

The required packages are imported from scikit-learn library. These include the dataset, logistic regression model, and train_test_split for splitting the dataset into training and testing sets.

Python3




# importing the necessary packages
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve,auc
import matplotlib.pyplot as plt


STEP 2: Loading data: 

The breast cancer dataset is loaded using the load_breast_cancer function.

Python3




# loading data
data = load_breast_cancer()


STEP 3: Splitting data into train and test: 

The data is split into training and testing sets using the train_test_split function. The data is split into 75% training data and 25% testing data.

Python3




# splitting data into train and test
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=0)


STEP 4: Logistic regression model:

 A logistic regression model is initialized using the LogisticRegression function with random_state set to 0.

Python3




# logistic regression model
lr = LogisticRegression(random_state=0)


STEP 5: Fitting the model:

 The model is trained on the training data using the fit function.

Python3




# fitting the model
lr.fit(X_train, y_train)


STEP 6 : Prediction:

 The model is used to predict the target variable for the testing data using the predict function.

Python3




# predicting 
print('prediction with threshold 0.5 :')
y_pred = lr.predict(X_test)
print(y_pred)


Output:

prediction with threshold 0.5 :
[0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 1 0 1
 0 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 1
 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 1
 0 0 1 1 1 1 1 1 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0]

STEP 7: Changing threshold and predicting:

 The threshold is changed to 0.9 and the model is used to predict the target variable for the testing data using the predict_proba function. The predict_proba function returns the probability estimates for each class, and we select the probability estimates for class 1 by using [:, 1]. Then we convert the probabilities into binary classes using the new threshold of 0.9.

Python3




# changing threshold and predicting
print('prediction with threshold 0.9 :')
y_pred_new_threshold = (lr.predict_proba(X_test)[:, 1] >= 0.9).astype(int)
print(y_pred_new_threshold)


Output:

prediction with threshold 0.9 :
[0 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 0
 0 1 0 0 1 0 1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 1 0
 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 0 1 0 0 0 0 1
 0 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 1 0 0 0 1 1 1 0]

STEP 8: Evaluating: 
We evaluate the model’s performance by computing several evaluation metrics using different threshold values. The default threshold is 0.5, and we use a new threshold of 0.9 to show the effect of changing the threshold value.

The evaluation metrics used here are:
– Accuracy: The percentage of correct predictions out of all predictions.
– Precision: The proportion of true positives among all positive predictions.
– Recall: The proportion of true positives among all actual positive cases.
– F1 score: The weighted average of precision and recall.

With the default Threshold value of 0.5

Python3




# Evaluation metrics for default threshold
print("Evaluation metrics with threshold 0.5:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 score:", f1_score(y_test, y_pred))


Output:

Evaluation metrics with threshold 0.5:
Accuracy: 0.951048951048951
Precision: 0.9770114942528736
Recall: 0.9444444444444444
F1 score: 0.96045197740113

With the Threshold value of 0.9

Python3




# Evaluation metrics for new threshold
print("Evaluation metrics with threshold 0.9:")
print("Accuracy:", accuracy_score(y_test, y_pred_new_threshold))
print("Precision:", precision_score(y_test, y_pred_new_threshold))
print("Recall:", recall_score(y_test, y_pred_new_threshold))
print("F1 score:", f1_score(y_test, y_pred_new_threshold))


Output:

Evaluation metrics with threshold 0.9:
Accuracy: 0.8811188811188811
Precision: 1.0
Recall: 0.8111111111111111
F1 score: 0.8957055214723927

Plot the ROC curve and Precision-Recall curve for the model’s performance with different threshold values.

ROC curve

The ROC curve is a graphical representation of the trade-off between the true positive rate (TPR) and the false positive rate (FPR) for different threshold values. It shows how well the model can distinguish between the two classes. A good model will have a ROC curve that is close to the top left corner of the plot.

Python3




# ROC Curve
y_scores = lr.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
plt.plot(fpr, tpr,  label = 'Threshold = 0.5')
print('Area Under the ROC Curve for threshold 0.5:', roc_auc_score(y_test, y_scores))
  
# For Threshold 0.9
fpr_, tpr_, thresholds_ = roc_curve(y_test, y_pred_new_threshold)
print('Area Under the ROC Curve for threshold 0.9:', roc_auc_score(y_test, y_pred_new_threshold))
plt.plot(fpr_, tpr_, label = 'Threshold = 0.9')
  
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()


Output:

Area Under the ROC Curve for threshold 0.5: 0.9939203354297694
Area Under the ROC Curve for threshold 0.9: 0.9055555555555556
ROC curve for different threshold - Geeksforgeeks

ROC curve for different threshold

Precision-Recall curve

The Precision-Recall curve shows the trade-off between precision and recall for different threshold values. It is useful when the classes are imbalanced. A good model will have a curve that is close to the top right corner of the plot.

By changing the threshold value, we can increase the precision and reduce the recall or vice versa. The ROC and Precision-Recall curves provide a visual representation of the model’s performance with different threshold values.

Python3




# Precision Recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
plt.plot(recall, precision, label = 'Threshold = 0.5')
print('Area Under the Curve (AUC) for threshold 0.5:', auc(recall, precision))
# For Threshold 0.9
precision_, recall_, thresholds_ = precision_recall_curve(y_test, y_pred_new_threshold)
plt.plot(precision_, recall_, label = 'Threshold = 0.9')
print('Area Under the Curve (AUC) for threshold 0.9:', auc(recall_, precision_))
  
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()


Output:

Area Under the Curve (AUC) for threshold 0.5: 0.9964651626297715
Area Under the Curve (AUC) for threshold 0.9: 0.964996114996115
Precision-Recall curve for different threshold - Geeksforgeeks

Precision-Recall curve for different threshold



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads