Open In App

Classification of Text Documents using the approach of Naive Bayes

Last Updated : 02 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In natural language processing and machine learning, the Naïve Bayes approach is a potent and popular method for classifying text documents. This method classifies documents into predetermined types based on the likelihood of a word occurring, utilizing the concepts of the Bayes theorem. This article aims to implement Document Classification using Naïve Bayes using Python.

Text Classification using Naive Bayes

A probabilistic classification technique, the naïve Bayes algorithm is predicated on robust, if naïve, independence assumptions in its probability models. Despite their simplicity, these presumptions serve as the algorithm’s foundation. Even if it frequently deviates from reality, the independence assumption adds to its “naive” characterization.

The Naive Bayes algorithm uses Thomas Bayes’ Bayes’ theorem, which forms the basis for probability model creation. The method can be trained using these probability models in supervised learning.

Naive Bayes Algorithm

The Naive Bayes algorithm is a probabilistic classification method that bases its predictions on the Bayes theorem. Based on observable data, the Bayes theorem determines a hypothesis’s probability. When using Naive Bayes, an instance’s features serve as the evidence, while the class to which the instance belongs serves as the hypothesis.

The algorithm employing the Bayes theory is broken down as follows:

Bayes Theorem

P(C|F) = \frac{P(F|C) * P(C)}{P(F)}

  • P(C|F): Probability of the instance belonging to a specific class given its features.
  • P(F|C): Probability of observing the features given the class.
  • P(C): Prior probability of the class.
  • P(F): Probability of observing the features.

The assumption of feature independence is what gives Naive Bayes its “naive” quality. It is computationally efficient since this makes calculations simpler.

P(F|C) = P(F_1|C) * P(F_2|C)* .....*P(F_n|c)

Using the Bayes theorem to combine observable data (features) with previous information (prior probabilities) and assume feature independence, Naive Bayes provides predictions. Naive Bayes is efficient in a variety of classification tasks despite its simplicity, particularly in text classification and natural language processing.

When to use Naive Bayes

There are several instances in which Naive Bayes can be applied with great effectiveness. Here are some of those scenarios:

  • Text Classification: Naive Bayes excels in text-based tasks such as spam filtering, sentiment analysis, and document categorization due to its simplicity and efficiency with high-dimensional data.
  • Limited Training Data: Naive Bayes can perform well with limited training data, making it valuable when dealing with small datasets or situations where collecting extensive labeled data is challenging.
  • Simple and Quick Prototyping: When a quick and simple solution is needed for prototyping or baseline performance, Naive Bayes is a suitable choice due to its ease of implementation.

Implementation to classify text documents using Naive Bayes

Importing Libraries

Python3

#importing libraries
import prettytable

                    

The “prettytable” library is imported by the code snippet, indicating a desire to provide tabular data that is aesthetically pleasing. This library is frequently used to present structured data in a table with formatting. Once imported, you can use its features to improve how tabular data is presented in your Python code.

Classification using Naive Bayes

Python3

print('\n *-----* Classification using Naïve bayes *-----* \n')
total_documents = int(input("Enter the Total Number of documents: "))
doc_class = []
i = 0
keywords = []
while not i == total_documents:
    doc_class.append([])
    text = input(f"\nEnter the text of Doc-{i+1} : ").lower()
    clas = input(f"Enter the class of Doc-{i+1} : ")
    doc_class[i].append(text.split())
    doc_class[i].append(clas)
    keywords.extend(text.split())
    i = i+1
keywords = set(keywords)
keywords = list(keywords)
keywords.sort()
to_find = input(
    "\nEnter the Text to classify using Naive Bayes: ").lower().split()
 
probability_table = []
for i in range(total_documents):
    probability_table.append([])
    for j in keywords:
        probability_table[i].append(0)
doc_id = 1
for i in range(total_documents):
    for k in range(len(keywords)):
        if keywords[k] in doc_class[i][0]:
            probability_table[i][k] += doc_class[i][0].count(keywords[k])
print('\n')

                    

Output:

 *-----* Classification using Naïve bayes *-----* 
Enter the Total Number of documents: 3
Enter the text of Doc-1 : I watched the movie.
Enter the class of Doc-1 : +
Enter the text of Doc-2 : I hated the movie.
Enter the class of Doc-2 : -
Enter the text of Doc-3 : poor acting.
Enter the class of Doc-3 : +
Enter the Text to classify using Naive Bayes: I hated the acting.


This code starts a basic Naive Bayes text classification. The user is prompted to enter the total number of documents, after which it collects details about each document, such as its text and class. After gathering the unique terms (keywords) that appear in every document, a probability table is created to count how many times each keyword appears in every document. When the user submits a text for classification, the likelihood that it belongs in each class is calculated based on the frequency of the term in the training materials. There’s a probability table with the outcomes.

Probability of Documents

Python3

import prettytable
keywords.insert(0, 'Document ID')
keywords.append("Class")
Prob_Table = prettytable.PrettyTable()
Prob_Table.field_names = keywords
Prob_Table.title = 'Probability of Documents'
x = 0
for i in probability_table:
    i.insert(0, x+1)
    i.append(doc_class[x][1])
    Prob_Table.add_row(i)
    x = x+1
print(Prob_Table)
print('\n')
for i in probability_table:
    i.pop(0)
totalpluswords = 0
totalnegwords = 0
totalplus = 0
totalneg = 0
vocabulary = len(keywords)-2
for i in probability_table:
    if i[len(i)-1] == "+":
        totalplus += 1
        totalpluswords += sum(i[0:len(i)-1])
    else:
        totalneg += 1
        totalnegwords += sum(i[0:len(i)-1])
keywords.pop(0)
keywords.pop(len(keywords)-1)

                    

Output:

+---------------------------------------------------------------------------+
| Probability of Documents |
+-------------+---------+-------+---+--------+------+-----+---------+-------+
| Document ID | acting. | hated | i | movie. | poor | the | watched | Class |
+-------------+---------+-------+---+--------+------+-----+---------+-------+
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | + |
| 2 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | - |
| 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | + |
+-------------+---------+-------+---+--------+------+-----+---------+-------+


This code generates and shows a probability table by using the “prettytable” package. The keywords are arranged with ‘Document ID’ at the start and ‘Class’ at the conclusion. Field names are set to keywords when creating a PrettyTable object, and a title is supplied. Next, document IDs and class labels are added to the table together with the probability values from the probability_table. The code determines the total number of occurrences and words for each class (‘+’ and ‘-‘) after printing the probability table. For additional examination, it modifies the vocabulary size and eliminates pointless components from the list of keywords.

Positive Class

Python3

# For positive class
temp = []
for i in to_find:
    count = 0
    x = keywords.index(i)
    for j in probability_table:
        if j[len(j)-1] == "+":
            count = count+j[x]
    temp.append(count)
    count = 0
for i in range(len(temp)):
    temp[i] = format((temp[i]+1)/(vocabulary+totalpluswords), ".4f")
print()
temp = [float(f) for f in temp]
print("Probabilities of Each word to be in '+' class are: ")
h = 0
for i in to_find:
    print(f"P({i}/+) = {temp[h]}")
    h = h+1
print()
pplus = float(format((totalplus)/(totalplus+totalneg), ".8f"))
for i in temp:
    pplus = pplus*i
pplus = format(pplus, ".8f")
print("probability of Given text to be in '+' class is :", pplus)
print()

                    

Output:

Probabilities of Each word to be in '+' class are: 
P(i/+) = 0.1429
P(hated/+) = 0.0714
P(the/+) = 0.1429
P(acting/+) = 0.1429
probability of Given text to be in '+' class is : 0.00013890

With the input text, this code calculates the likelihood that each word belongs to the positive class (‘+’). Iteratively going over each word in “to_find,” it determines how often each word occurs in the positive class based on the probability table and uses Laplace smoothing to obtain the conditional probabilities. After that, the results are written out, displaying the probability of each word receiving the positive class. Lastly, it uses these word probabilities to compute the overall chance that the input text belongs to the positive class, and it prints the outcome. Non-zero probabilities for unseen words are guaranteed by the Laplace smoothing.

Negative class

Python3

# For Negative class
temp = []
for i in to_find:
    count = 0
    x = keywords.index(i)
    for j in probability_table:
        if j[len(j)-1] == "-":
            count = count+j[x]
    temp.append(count)
    count = 0
for i in range(len(temp)):
    temp[i] = format((temp[i]+1)/(vocabulary+totalnegwords), ".4f")
print()
temp = [float(f) for f in temp]
print("Probabilities of Each word to be in '-' class are: ")
h = 0
for i in to_find:
    print(f"P({i}/-) = {temp[h]}")
    h = h+1
print()
pneg = float(format((totalneg)/(totalplus+totalneg), ".8f"))
for i in temp:
    pneg = pneg*i
pneg = format(pneg, ".8f")
print("probability of Given text to be in '-' class is :", pneg)
print('\n')

                    

Output:

Probabilities of Each word to be in '-' class are: 
P(i/-) = 0.1667
P(hated/-) = 0.1667
P(the/-) = 0.1667
P(acting/-) = 0.0833
probability of Given text to be in '-' class is : 0.00012863

The probability that each word in the input text belongs to the negative class (‘-‘) are calculated by this code. Iterating through every word in “to_find,” it determines each word’s occurrences in the negative class using the probability table, and then computes conditional probabilities using Laplace smoothing, just like the positive class computation does. The probability of each word being assigned to the negative class is then printed along with the findings. Lastly, it uses these word probabilities to compute the overall chance that the input text belongs to the negative class, and it prints the result. In both positive and negative class calculations, the Laplace smoothing guarantees non-zero probabilities for unseen words.

Prediction

Python3

if pplus > pneg:
    print(
        f"Using Naive Bayes Classification, We can clearly say that the given text belongs to '+' class with probability {pplus}")
else:
    print(
        f"Using Naive Bayes Classification, We can clearly say that the given text belongs to '-' class with probability {pneg}")
print('\n')

                    

Output:


Probabilities of Each word to be in '+' class are:
P(i/+) = 0.1538
P(hated/+) = 0.0769
P(the/+) = 0.1538
P(acting./+) = 0.1538
probability of Given text to be in '+' class is : 0.00018651
Probabilities of Each word to be in '-' class are:
P(i/-) = 0.1818
P(hated/-) = 0.1818
P(the/-) = 0.1818
P(acting./-) = 0.0909
probability of Given text to be in '-' class is : 0.00018206
Using Naive Bayes Classification, We can clearly say that the given text belongs to '+' class with probability 0.00018651


The probabilities computed for the positive and negative classes are the basis for this code’s ultimate judgment. It prints a statement proposing a positive class prediction together with the corresponding probability if the likelihood of the text falling into the positive class (pplus) is higher than the likelihood of it falling into the negative class (pneg). If not, a message with the corresponding probability and a negative class forecast is printed.

Also Check:

Frequently Asked Questions (FAQs)

1. What is Naive Bayes classification in the context of text documents?

Text documents are categorized into predetermined classes using the probabilistic Naive Bayes classification method. It is especially useful for text-based applications like sentiment analysis, spam detection, and document classification since it applies Bayes’ theorem under the naive assumption of feature independence.

2. How does Naive Bayes handle the issue of feature independence in text classification?

Because Naive Bayes implies feature independence, every feature (word) is treated independently of all other features given the same class name. Even with this simplification, Naive Bayes is frequently effective in text classification, particularly in cases when the features (words) are conditionally independent with respect to the class.

3. Can Naive Bayes be used for real-time text classification?

Yes, because of its computational efficiency, Naive Bayes is a good choice for real-time text classification. It is appropriate for applications needing quick decision-making because of its speedy processing and classification of fresh text instances.

4. Is Naive Bayes suitable for large datasets of text documents?

Naive Bayes is effective and works well with big text document datasets, yeah. It’s a good option for managing large amounts of textual data because of its speed and simplicity.

5. How does Naive Bayes handle the presence of irrelevant words in text documents?

Naive Bayes is susceptible to words that aren’t relevant. Even though it frequently works well in noisy environments, having too many superfluous features could reduce its accuracy. Techniques for feature selection or preprocessing can lessen this sensitivity.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads