With this article be ready to get your hands dirty with ML algorithms, concepts, Maths and coding.
To work with ML code, libraries play a very important role in Python which we will study in details but let see a very brief description of the most important ones :
- NumPy (Numerical Python) : It is one of the greatest Scientific and Mathematical computing library for Python. Platforms like Keras, Tensorflow have embedded Numpy operations on Tensors. The feature we are concerned with its power and easy to handle and perform operation on Array.
- Pandas : This package is very useful when it comes to handle data. This makes it very easier to manipulate, aggregate and visualize data.
- MatplotLib : This library facilitates the task of powerful and very simple visualizations.
There are many more libraries but they have no use right now. So, let’s begin.
Download the dataset :
Go to the link and download Data_for_Missing_Values.csv.
I would suggest you guys to install Anaconda on your systems. Launch Spyder our Jupyter on your system. Reason behind suggesting is – Anaconda has all the basic Python Libraries pre installed in it.
Below is the Python code :
Data Head : Country Age Salary Purchased 0 France 44.0 72000.0 No 1 Spain 27.0 48000.0 Yes 2 Germany 30.0 54000.0 No 3 Spain 38.0 61000.0 No 4 Germany 40.0 NaN Yes Data Describe : Age Salary count 9.000000 9.000000 mean 38.777778 63777.777778 std 7.693793 12265.579662 min 27.000000 48000.000000 25% 35.000000 54000.000000 50% 38.000000 61000.000000 75% 44.000000 72000.000000 max 50.000000 83000.000000 Input : [['France' 44.0 72000.0] ['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 nan] ['France' 35.0 58000.0] ['Spain' nan 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0]] Output: ['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes'] New Input with Mean Value for NaN : [['France' 44.0 72000.0] ['Spain' 27.0 48000.0] ['Germany' 30.0 54000.0] ['Spain' 38.0 61000.0] ['Germany' 40.0 63777.77777777778] ['France' 35.0 58000.0] ['Spain' 38.77777777777778 52000.0] ['France' 48.0 79000.0] ['Germany' 50.0 83000.0] ['France' 37.0 67000.0]]
CODE EXPLANATION :
- Part 1 – Importing Libraries : In the above code, imported numpy, pandas and matplotlib but we have used pandas only.
- PART 2 – Importing Data :
Data_for_Missing_Values.csvby giving the path to pandas read_csv function. Now “data_sets” is a DataFrame(Two-dimensional tabular data structure with labeled rows and columns).
- Then print first 5 data-entries of the dataframe using head() function. Number of entries can be changed for e.g. for first 3 values we can use dataframe.head(3). Similarly, last values can also be gotten using tail() function.
- Then used describe() function. It gives statistical summary of data which includes min, max, percentile (.25, .5, .75), mean and standard deviation for each parameter values.
- PART 3 – Input and Output Data : We split our dataframe to input and output.
- PART 4 – Handling the missing values : Using Imputer() function from sklearn.preprocessing package.
Imputer(missing_values=’NaN’, strategy=’mean’, axis=0, verbose=0, copy=True) is a function from Imputer class of sklearn.preprocessing package. It’s role is to transformer parameter value from missing values(NaN) to set strategic value.
Syntax : sklearn.preprocessing.Imputer() Parameters : -> missing_values : integer or “NaN” -> strategy : What to impute - mean, median or most_frequent along axis -> axis(default=0) : 0 means along column and 1 means along row