Open In App

Gaussian Naive Bayes

Last Updated : 13 Nov, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

In the vast field of machine learning, classification algorithms play a pivotal role in making sense of data. One such algorithm, Gaussian Naive Bayes, stands out for its simplicity, efficiency, and effectiveness. In this article, we will delve into the principles behind Gaussian Naive Bayes, explore its applications, and understand why it is a popular choice for various tasks.

Gaussian Naive Bayes

Gaussian Naive Bayes is a type of Naive Bayes method where continuous attributes are considered and the data features follow a Gaussian distribution throughout the dataset. In Sklearn library terminology, Gaussian Naive Bayes is a type of classification algorithm working on continuous normally distributed features that is based on the Naive Bayes algorithm. Before diving deep into this topic we must gain a basic understanding of the principles on which Gaussian Naive Bayes work. Here are some terminologies that can help us gain knowledge and ease our further study:

Naive Bayes Classifier:

The Naive Bayes Classifier is based on a simple concept from probability theory called the Bayes Theorem. This classification algorithm does really well in predicting the correct class the present features belong to. The ‘naive’ in the name of this algorithm is used based on the assumption that this algorithm considers while predicting the label of the features. The assumption here is that all the features are independent of each other, though this might not be true in a real-world scenario. Still, the algorithm works fine. In the training process of a Naive Bayes classifier, the algorithm mainly focuses on two factors based on the dataset it fits:

  • Prior Probability (P(y)): This is the probability of a specific class to happen, this algorithm calculates the prior probability by dividing the occurrence of the class ‘y’ by the total number of instances.
  • Class Conditional Probability(P(\textbf x_i     |y)): This states the presence of each feature x_i     given that the class occurring is the class y. This probability is considered for each class and each feature.

Now discussing about how this algorithm makes prediction or what happens behind the scene of this algorithm, Naive Bayes considers the probability of occurrence of each class and assigns the label value to the class with higher probability.

  • Posterior Probability(P(y|x)): This is the probability that will help the algorithm determine each class’ probability of occurrence. Here P(y|x) tells us about the occurrence of class y given the feature x. It is calculated by the help of Bayes theorem given by: P(y|x) = (P(x|y)*P(y))/(P(x)) we will also talk about Bayes theorem in a while.
  • P(x|y): This is the product of all the conditional probabilities of each feature x_i     given a class y. It is given by P(x_1|y) * P(x_2|y) * P(x_3|y)     … and so on.
  • P(x): This is the probability of occurrence of feature x, it reduces a probability function to a distributed function where the total probability is 1.

This algorithm then compares different posterior probabilities according to the number of classes present in the data and the class with higher probability is assigned to the feature set present.

Bayes Theorem:

Bayes theorem is a way to develops a way to update probabilities based on new information. The theorem is given as:

\begin{aligned} P(A|B)& = \frac{P(A \cap B)}{P(B)} \\ &= \frac{P(B|A)\cdot P(A) }{P(B)} \end{aligned}

Where P(A|B) is the posterior probability and states probability of occurrence of A given B has happened, P(A) is the prior probability, P(B) is the probability of occurrence of event B, P(B|A) is the probability of occurrence of B given A has already happened.

Gaussian Naive Bayes:

Gaussian Naive Bayes is the application of Naive Bayes on a normally distributed data. Gaussian Naive Bayes assumes that the likelihood(P(x_i|y    )) follows the Gaussian Distribution for each x_i    within y_k    . Therefore,

P(x_i|y) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}

To classify each new data point x the algorithm finds out the maximum value of the posterior probability of each class and assigns the data point to that class.

Real life example with Gaussian Naive Bayes:

Here we will be applying Gaussian Naive Bayes to the Iris Dataset, this dataset consists of four features namely Sepal Length in cm, Sepal Width in cm, Petal Length in cm, Petal Width in cm and from these features we have to identify which feature set belongs to which specie class. The iris flower dataset could be obtained from here.

Now we will be using Gaussian Naive Bayes in predicting the correct specie of Iris flower.

Lets break down the above code step by step:

  • First we will be importing the required libraries: pandas for data manipulation, train_test_split to split the data into training and testing sets, GaussianNB for the Gaussian Naive Bayes classifier, accuracy_score to evaluate the model, and LabelEncoder to encode the categorical target variable.

Python3

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

                    
  • After that we will load the Iris dataset from a CSV file named “Iris.csv” into a pandas DataFrame.
  • Then we will separate the features (X) and the target variable (y) from the dataset. Features are obtained by dropping the “Species” column, and the target variable is set to the “Species” column which we will be predicting.

Python3

# Load the Iris dataset
data = pd.read_csv("Iris.csv")
 
# Select features and target
X = data.drop("Species", axis=1)
y = data['Species']

                    
  • Since the target variable “Species” is categorical, we will be using LabelEncoder to convert it into numerical form. This is necessary for the Gaussian Naive Bayes classifier, as it requires numerical inputs.
  • We will be splitting the dataset into training and testing sets using the train_test_split function. 70% of the data is used for training, and 30% is used for testing. The random_state parameter ensures reproducibility of the same data.

Python3

# Encoding the Species column to get numerical class
le = LabelEncoder()
y = le.fit_transform(y)
 
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

                    
  • We will be creating a Gaussian Naive Bayes Classifier(gnb) and then training it on the training data using the fit method.

Python3

# Gaussian Naive Bayes classifier
gnb = GaussianNB()
 
# Train the classifier on the training data
gnb.fit(X_train, y_train)

                    
  • At last we will be using the trained model to make predictions on the testing data.

Python3

# Make predictions on the testing data
y_pred = gnb.predict(X_test)
 
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"The Accuracy of Prediction on Iris Flower is: {accuracy}")

                    

Output:

The Accuracy of Prediction on Iris Flower is: 1.0


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads