In this article, we will be making a project through Python language which will be using some Machine Learning Algorithms too. It will be an exciting one as after this project you will understand the concepts of using AI & ML with a scripting language. The following libraries/packages will be used in this project:

**numpy****:**It’s a Python library that is employed for scientific computing. It contains among other things – a strong array object, mathematical and statistical tools for integrating with other language’s code i.e. C/C++ and Fortran code.**pandas****:**It’s a Python package providing fast, flexible, and expressive data structures designed to form working with “relational” or “labeled” data both easy and intuitive.**matplotlib****:**Matplotlib may be a plotting library for the Python programing language which produces 2D plots to render visualization and helps in exploring the info sets. matplotlib.pyplot could be a collection of command style functions that make matplotlib work like MATLAB.**searborn****:**. Seaborn is an open-source Python library built on top of matplotlib. It’s used for data visualization and exploratory data analysis. Seaborn works easily with dataframes and also the Pandas library.

## Python3

`# Checking for any warning` `import` `warnings` `warnings.filterwarnings(` `'ignore'` `)` |

After this step we will install some **dependencies**: Dependencies are all the software components required by your project in order for it to work as intended and avoid runtime errors. We will be needing the **numpy, pandas, matplotlib & seaborn libraries / dependencies. **As we will need a CSV file to do the operations, for this project we will be using a CSV file that contains data for **Tumor** (brain disease). So in this project at last we will be able to predict whether a subject (candidate) has a potent chance of suffering from a Tumor or not?

**Step 1: Pre-processing the Data:**

## Python3

`# Importing dependencies` `import` `numpy as np` `import` `pandas as pd` `import` `matplotlib.pyplot as plt` `import` `seaborn as sns` ` ` `# Including & Reading the CSV file:` |

Now we will check that the CSV file has been read successfully or not? So we will use the **head method****:** head() method is used to return top n (5 by default) rows of a data frame or series.

## Python3

`df.head()` |

## Python3

`# Check the names of all columns` `df.columns` |

So this command will fetch the column’s header names. The output will be this:

Now in order to understand the data set briefly by getting a quick overview of the data-set, we will use **info() method.** This method very well handles the exploratory analysis of the data-sets.

## Python3

`df.info()` |

**Output for above command:**

In the CSV file, there may be some blanked fields that can harm the project (that is they will hamper the prediction).

## Python3

`df[` `'Unnamed: 32'` `]` |

**Output: **

Now as we have successfully found the vacant spaces in the data set, so now we will remove them.

## Python3

`df ` `=` `df.drop(` `"Unnamed: 32"` `, axis` `=` `1` `)` ` ` `# to check whether those values are ` `# deletd or not:` `df.head()` ` ` `# also check the columns after this` `# process:` `df.columns` ` ` `df.drop(` `'id'` `, axis` `=` `1` `, inplace` `=` `True` `)` `# we can do this also: df = df.drop('id', axis=1)` ` ` `# To see the change, again go through ` `# the columns` `df.columns` |

Now we will check the class type of the columns with the help of **type() method.** It returns the class type of the argument(object) passed as a parameter.

## Python3

`type` `(df.columns)` |

**Output:**

pandas.core.indexes.base.Index

We will be needing to traverse and sort the data by their columns, so we will save the columns in a variable.

## Python3

`l ` `=` `list` `(df.columns)` `print` `(l)` |

Now we will access the data with different start points. Say we will categorize the columns from 1 to 11 in a variable named **features_mean** and so on.

## Python3

`features_mean ` `=` `l[` `1` `:` `11` `]` ` ` `features_se ` `=` `l[` `11` `:` `21` `]` ` ` `features_worst ` `=` `l[` `21` `:]` |

## Python3

`df.head (` `2` `)` |

**In the ‘Diagnosis’ column of the CSV file, **there are two options one is **M = Malignant & B = Begin** which basically tells the stage of the Tumor. But the same we will verify from the code.

## Python3

`# To check what value does the Diagnosis field have` `df[` `'diagnosis'` `].unique()` `# M stands for Malignant, B stands for Begin` |

**Output:**

array(['M', 'B'], dtype=object)

So it verifies that there are only two values in the Diagnosis field.

Now in order to get a fair idea of how many cases are having **malignant tumor and who are in the beginning stage,** we will use the countplot() method.

## Python3

`sns.countplot(df[` `'diagnosis'` `], label` `=` `"Count"` `,);` |

If we don’t have to see the graph for the values, then I can use a function that will return the numerical values of the occurrences.

Now we will be able to be using the shape() method. Shape returns the form of an array. The form could be a tuple of integers. These numbers tell the lengths of the corresponding array dimension. In other words: The “shape” of an array may be a tuple with the number of elements per axis (dimension). For instance, the form is adequate to (6, 3), i.e. we’ve got 6 lines and three columns.

## Python3

`df.shape` |

**Output:**

(539, 31)

which means that in the data set there are 539 lines and 31 columns.

As of now, we are ready with the to-be-processed dataset, so we will be able to be using describe( ) method which is employed to look at some basic statistical details like percentile, mean, std etc. of a knowledge frame or a series of numeric values.

## Python3

`# Summary of all numeric values` `df.decsbibe()` |

After all, this stuff, we will be using the corr( ) method to find the correlation between different fields. **Corr( ) ** is used to find the pairwise correlation of all columns in the data frame. Any nan values are automatically excluded. For any non-numeric data type columns in the data-frame, it is ignored.

## Python3

`# Correlation Plot` `corr ` `=` `df.corr()` `corr` |

This command will provide 30 rows * 30 columns table which will be having rows like **radius_mean, texture_se **and so on.

The command corr.shape( ) will return (30, 30). The next step is plotting the statistics via heatmap. A heatmap could even be a two-dimensional graphical representation of information where the individual values that are contained during a matrix are represented as colors. The seaborn package allows the creation of annotated heatmaps which can be changed a little by using Matplotlib tools as per the creator’s requirement.

## Python3

`# making a heatmap` `plt.figure(figsize` `=` `(` `14` `, ` `14` `))` `sns.heatmap(corr)` |

Again we will be checking the CSV data set in order to ensure that the columns are just fine and haven’t been affected by the operations.

## Python3

`df.head()` |

This will return a table through which one can be assured that the data set is well sorted or not. In the few next commands, we will be segregating the data.

## Python3

`df[` `'diagnosis'` `] ` `=` `df[` `'diagnosis'` `].` `map` `({` `'M'` `: ` `1` `, ` `'B'` `: ` `0` `})` `df.head()` ` ` `df[` `'diagnosis'` `].unique()` ` ` `X ` `=` `df.drop(` `'diagnosis'` `, axis` `=` `1` `)` `X.head()` ` ` `y ` `=` `df[` `'diagnosis'` `]` `y.head()` |

**Note:** As we have prepared a prediction model which can be used with any of the machine-learning model, so now we will use one by one show you the output of the prediction model with each of the machine learning algorithms.

**Step 2: Test Checking or Training The Data set**

**Using Logistic Regression Model:**

## Python3

`# divide the dataset into train and test set` `from` `sklearn.preprocessing ` `import` `StandardScaler` `from` `sklearn.model_selection ` `import` `train_test_split` `X_train, X_test, y_train, y_test ` `=` `train_test_split(X, y, test_size` `=` `0.3` `)` ` ` `df.shape` `# o/p: (569, 31)` ` ` `X_train.shape` `# o/p: (398, 30)` ` ` `X_test.shape` `# o/p: (171, 30)` ` ` `y_train.shape` `# o/p: (398,)` ` ` `y_test.shape` `# o/p: (171,)` ` ` `X_train.head(` `1` `)` `# will return the top 5 rows (if exists)` ` ` `ss ` `=` `StandardScaler()` `X_train ` `=` `ss.fit_transform(X_train)` `X_test ` `=` `ss.transform(X_test)` ` ` `X_train` |

**Output: **

After doing the basic training of the model we can test this by using one of the **Machine Learning Models. **So we will be testing this by using **Logistic Regression, Decision Tree Classifier, Random Forest Classifier and SVM.**

## Python3

`# apply Logistic Regression` ` ` `from` `sklearn.linear_model ` `import` `LogisticRegression` `lr ` `=` `LogisticRegression()` `lr.fit(X_train, y_train)` ` ` `# implemented our model through logistic regression` `y_pred ` `=` `lr.predict(X_test)` `y_pred` ` ` `# array containing the actual output` `y_test` |

**Output:**

To mathematically check to what extent the model has predicted the correct value:

## Python3

`from` `sklearn.metrics ` `import` `accuracy_score` `print` `(accuracy_score(y_test, y_pred))` |

**Output:**

0.9883040935672515

Now let’s frame the results in the form of a table.

## Python3

`tempResults ` `=` `pd.DataFrame({` `'Algorithm'` `:[` `'Logistic Regression Method'` `], ` `'Accuracy'` `:[lr_acc]})` `results ` `=` `pd.concat( [results, tempResults] )` `results ` `=` `results[[` `'Algorithm'` `,` `'Accuracy'` `]]` `results` |

**Output:**

**Using Decision Tree Model:**

## Python3

`# apply Decision Tree Classifier` `from` `sklearn.metrics ` `import` `accuracy_score` `from` `sklearn.tree ` `import` `DecisionTreeClassifier` `dtc ` `=` `DecisionTreeClassifier()` `dtc.fit(X_train, y_train)` ` ` `y_pred ` `=` `dtc.predict(X_test)` `y_pred` ` ` `print` `(accuracy_score(y_test, y_pred))` ` ` `# Tabulating the results` `tempResults ` `=` `pd.DataFrame({` `'Algorithm'` `: [` `'Decision tree Classifier Method'` `], ` ` ` `'Accuracy'` `: [dtc_acc]})` `results ` `=` `pd.concat([results, tempResults])` `results ` `=` `results[[` `'Algorithm'` `, ` `'Accuracy'` `]]` `results` |

**Output:**

**Using Random Forest Model:**

## Python3

`# apply Rnadom Forest Classifier` `from` `sklearn.metrics ` `import` `accuracy_score` `from` `sklearn.ensemble ` `import` `RandomForestClassifier` `rfc ` `=` `RandomForestClassifier()` `rfc.fit(X_train, y_train)` ` ` `y_pred ` `=` `rfc.predict(X_test)` `y_pred` ` ` `print` `(accuracy_score(y_test, y_pred))` ` ` `# tabulating the results` `tempResults ` `=` `pd.DataFrame({` `'Algorithm'` `: [` `'Random Forest Classifier Method'` `], ` ` ` `'Accuracy'` `: [rfc_acc]})` ` ` `results ` `=` `pd.concat([results, tempResults])` `results ` `=` `results[[` `'Algorithm'` `, ` `'Accuracy'` `]]` `results` |

**Output:**

**Using SVM:**

## Python3

`# apply Support Vector Machine` `from` `sklearn ` `import` `svm` `svc ` `=` `svm.SVC()` `svc.fit(X_train,y_train` ` ` `y_pred ` `=` `svc.predict(X_test)` `y_pred` ` ` `from` `sklearn.metrics ` `import` `accuracy_score` `print` `(accuracy_score(y_test, y_pred))` |

**Output:**

**So now we can check that which model effectively produced **a **higher number of correct predictions through this table:**

## Python3

`# Tabulating the results` `tempResults ` `=` `pd.DataFrame({` `'Algorithm'` `: [` `'Support Vector Classifier Method'` `], ` ` ` `'Accuracy'` `: [svc_acc]})` `results ` `=` `pd.concat([results, tempResults])` `results ` `=` `results[[` `'Algorithm'` `, ` `'Accuracy'` `]]` `results` |

**Output:**

After going through the accuracy of the above-used machine learning algorithms, I can conclude that these algorithms will give the same output every time if the same data set is fed. I can also say that these algorithms majorly provide the same output of prediction accuracy even if the data set is changed.

From the above table, we can conclude that through SVM Model and Logistic Regression Model were the best-suited models for my project.