ML | Kaggle Breast Cancer Wisconsin Diagnosis using KNN

Dataset :
It is given by Kaggle from UCI Machine Learning Repository, in one of its challenges.
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. It is a dataset of Breast Cancer patients with Malignant and Benign tumor.
K-nearest neighbor algorithm is used to predict whether is patient is having cancer (Malignant tumor) or not (Benign tumor).

Implementation of KNN algorithm for classification.

Code : Loading Libraries



filter_none

edit
close

play_arrow

link
brightness_4
code

# performing linear algebra
import numpy as np 
  
# data processing
import pandas as pd
  
# visualisation
import matplotlib.pyplot as plt

chevron_right


Code : Loading dataset

filter_none

edit
close

play_arrow

link
brightness_4
code

df = pd.read_csv("..\\breast-cancer-wisconsin-data\\data.csv")
  
print (data.head)

chevron_right


Output :

Code : Loading dataset

filter_none

edit
close

play_arrow

link
brightness_4
code

df.info()

chevron_right


Output :


RangeIndex: 569 entries, 0 to 568
Data columns (total 33 columns):
id                         569 non-null int64
diagnosis                  569 non-null object
radius_mean                569 non-null float64
texture_mean               569 non-null float64
perimeter_mean             569 non-null float64
area_mean                  569 non-null float64
smoothness_mean            569 non-null float64
compactness_mean           569 non-null float64
concavity_mean             569 non-null float64
concave points_mean        569 non-null float64
symmetry_mean              569 non-null float64
fractal_dimension_mean     569 non-null float64
radius_se                  569 non-null float64
texture_se                 569 non-null float64
perimeter_se               569 non-null float64
area_se                    569 non-null float64
smoothness_se              569 non-null float64
compactness_se             569 non-null float64
concavity_se               569 non-null float64
concave points_se          569 non-null float64
symmetry_se                569 non-null float64
fractal_dimension_se       569 non-null float64
radius_worst               569 non-null float64
texture_worst              569 non-null float64
perimeter_worst            569 non-null float64
area_worst                 569 non-null float64
smoothness_worst           569 non-null float64
compactness_worst          569 non-null float64
concavity_worst            569 non-null float64
concave points_worst       569 non-null float64
symmetry_worst             569 non-null float64
fractal_dimension_worst    569 non-null float64
Unnamed: 32                0 non-null float64
dtypes: float64(31), int64(1), object(1)
memory usage: 146.8+ KB

Code: We are dropping columns – ‘id’ and ‘Unnamed: 32’ as they have no role in prediction

filter_none

edit
close

play_arrow

link
brightness_4
code

df.drop(['Unnamed: 32', 'id'], axis = 1)
print(df.shape)

chevron_right


Output:

(569, 31)

Converting the diagnosis value of M and B to a numerical value
M (Malignant) = 1
B (Benign) = 0

filter_none

edit
close

play_arrow

link
brightness_4
code

def diagnosis_value(diagnosis):
    if diagnosis == 'M':
        return 1
    else:
        return 0
  
df['diagnosis'] = df['diagnosis'].apply(diagnosis_value)

chevron_right


Code :

filter_none

edit
close

play_arrow

link
brightness_4
code

sns.lmplot(x = 'radius_mean', y = 'texture_mean', hue = 'diagnosis', data = df)

chevron_right


Output:


Code :


filter_none

edit
close

play_arrow

link
brightness_4
code

sns.lmplot(x ='smoothness_mean', y = 'compactness_mean'
           data = df, hue = 'diagnosis')

chevron_right


Output:


Code : Input and Output data

filter_none

edit
close

play_arrow

link
brightness_4
code

X = np.array(df.iloc[:, 1:])
y = np.array(df['diagnosis'])

chevron_right


Code : Splitting data to training and testing

filter_none

edit
close

play_arrow

link
brightness_4
code

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.33, random_state = 42)

chevron_right


Code : Using Sklearn

filter_none

edit
close

play_arrow

link
brightness_4
code

knn = KNeighborsClassifier(n_neighbors = 13)
knn.fit(X_train, y_train)

chevron_right


Output:

KNeighborsClassifier(algorithm='auto', leaf_size=30, 
             metric='minkowski', metric_params=None, 
             n_jobs=None, n_neighbors=13, p=2, 
             weights='uniform')

Code : Prediction Score

filter_none

edit
close

play_arrow

link
brightness_4
code

knn.score(X_test, y_test)

chevron_right


Output:

0.9627659574468085

Code : Performing Cross Validation

filter_none

edit
close

play_arrow

link
brightness_4
code

neighbors = []
cv_scores = []
  
from sklearn.model_selection import cross_val_score
# perform 10 fold cross validation
for k in range(1, 51, 2):
    neighbors.append(k)
    knn = KNeighborsClassifier(n_neighbors = k)
    scores = cross_val_score(
        knn, X_train, y_train, cv = 10, scoring = 'accuracy')
    cv_scores.append(scores.mean())

chevron_right


Code : Misclassification error versus k

filter_none

edit
close

play_arrow

link
brightness_4
code

MSE = [1-x for x in cv_scores]
  
# determining the best k
optimal_k = neighbors[MSE.index(min(MSE))]
print('The optimal number of neighbors is % d ' % optimal_k)
  
# plot misclassification error versus k
plt.figure(figsize = (10, 6))
plt.plot(neighbors, MSE)
plt.xlabel('Number of neighbors')
plt.ylabel('Misclassification Error')
plt.show()

chevron_right


Output:

The optimal number of neighbors is 13 



My Personal Notes arrow_drop_up

Aspire to Inspire before I expire

If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.

Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.




Article Tags :
Practice Tags :


Be the First to upvote.


Please write to us at contribute@geeksforgeeks.org to report any issue with the above content.