ML | Kaggle Breast Cancer Wisconsin Diagnosis using KNN
Dataset :
It is given by Kaggle from UCI Machine Learning Repository, in one of its challenges.
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data. It is a dataset of Breast Cancer patients with Malignant and Benign tumor.
K-nearest neighbor algorithm is used to predict whether is patient is having cancer (Malignant tumor) or not (Benign tumor).
Implementation of KNN algorithm for classification.
# performing linear algebra import numpy as np # data processing import pandas as pd # visualisation import matplotlib.pyplot as plt |
Code : Loading dataset
df = pd.read_csv( "..\\breast-cancer-wisconsin-data\\data.csv" ) print (data.head) |
Output :
Code : Loading dataset
df.info() |
Output :
RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): id 569 non-null int64 diagnosis 569 non-null object radius_mean 569 non-null float64 texture_mean 569 non-null float64 perimeter_mean 569 non-null float64 area_mean 569 non-null float64 smoothness_mean 569 non-null float64 compactness_mean 569 non-null float64 concavity_mean 569 non-null float64 concave points_mean 569 non-null float64 symmetry_mean 569 non-null float64 fractal_dimension_mean 569 non-null float64 radius_se 569 non-null float64 texture_se 569 non-null float64 perimeter_se 569 non-null float64 area_se 569 non-null float64 smoothness_se 569 non-null float64 compactness_se 569 non-null float64 concavity_se 569 non-null float64 concave points_se 569 non-null float64 symmetry_se 569 non-null float64 fractal_dimension_se 569 non-null float64 radius_worst 569 non-null float64 texture_worst 569 non-null float64 perimeter_worst 569 non-null float64 area_worst 569 non-null float64 smoothness_worst 569 non-null float64 compactness_worst 569 non-null float64 concavity_worst 569 non-null float64 concave points_worst 569 non-null float64 symmetry_worst 569 non-null float64 fractal_dimension_worst 569 non-null float64 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB
Code: We are dropping columns – ‘id’ and ‘Unnamed: 32’ as they have no role in prediction
df.drop([ 'Unnamed: 32' , 'id' ], axis = 1 ) print (df.shape) |
Output:
(569, 31)
Converting the diagnosis value of M and B to a numerical value
M (Malignant) = 1
B (Benign) = 0
def diagnosis_value(diagnosis): if diagnosis = = 'M' : return 1 else : return 0 df[ 'diagnosis' ] = df[ 'diagnosis' ]. apply (diagnosis_value) |
Code :
sns.lmplot(x = 'radius_mean' , y = 'texture_mean' , hue = 'diagnosis' , data = df) |
Output:
Code :
sns.lmplot(x = 'smoothness_mean' , y = 'compactness_mean' , data = df, hue = 'diagnosis' ) |
Output:
Code : Input and Output data
X = np.array(df.iloc[:, 1 :]) y = np.array(df[ 'diagnosis' ]) |
Code : Splitting data to training and testing
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.33 , random_state = 42 ) |
Code : Using Sklearn
knn = KNeighborsClassifier(n_neighbors = 13 ) knn.fit(X_train, y_train) |
Output:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=13, p=2, weights='uniform')
Code : Prediction Score
knn.score(X_test, y_test) |
Output:
0.9627659574468085
Code : Performing Cross Validation
neighbors = [] cv_scores = [] from sklearn.model_selection import cross_val_score # perform 10 fold cross validation for k in range ( 1 , 51 , 2 ): neighbors.append(k) knn = KNeighborsClassifier(n_neighbors = k) scores = cross_val_score( knn, X_train, y_train, cv = 10 , scoring = 'accuracy' ) cv_scores.append(scores.mean()) |
Code : Misclassification error versus k
MSE = [ 1 - x for x in cv_scores] # determining the best k optimal_k = neighbors[MSE.index( min (MSE))] print ( 'The optimal number of neighbors is % d ' % optimal_k) # plot misclassification error versus k plt.figure(figsize = ( 10 , 6 )) plt.plot(neighbors, MSE) plt.xlabel( 'Number of neighbors' ) plt.ylabel( 'Misclassification Error' ) plt.show() |
Output:
The optimal number of neighbors is 13
Recommended Posts:
- ML | Kaggle Breast Cancer Wisconsin Diagnosis using Logistic Regression
- ML | Cancer cell classification using Scikit-learn
- How Should a Machine Learning Beginner Get Started on Kaggle?
- Getting started with Kaggle : A quick guide for beginners
- ML | Boston Housing Kaggle Challenge with Linear Regression
- Change current working directory with Python
- Gradual typing in Python
- How to move Files and Directories in Python
- Python - Call function from another function
- Python - Initialize empty array of given length
- Python | Alternate vowels and consonents in String
- Open a File in Python
- __exit__ in Python
- Add CSS to the Jupyter Notebook using Pandas
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.