Open In App

Rule Based Approach in NLP

Last Updated : 17 Apr, 2023
Improve
Improve
Like Article
Like
Save
Share
Report

Natural Language Processing serves as an interrelationship between human language and computers. It is a subfield of Artificial Intelligence that helps machines process, understand and generate natural language intuitively. Common tasks done by NLP are text and speech processing, language translation, sentiment analysis, etc. The use cases include spam detection, chatbots, text summarization, etc.

There are three types of NLP approaches:

  1. Rule-based Approach – Based on linguistic rules and patterns
  2. Machine Learning Approach – Based on statistical analysis
  3. Neural Network Approach – Based on various artificial, recurrent, and convolutional neural network algorithms

Rule-based approach in NLP

Rule-based approach is one of the oldest NLP methods in which predefined linguistic rules are used to analyze and process textual data. Rule-based approach involves applying a particular set of rules or patterns to capture specific structures, extract information, or perform tasks such as text classification and so on. Some common rule-based techniques include regular expressions and pattern matches.

Steps in Rule-based approach in NLP:

  1. Rule Creation: Based on the desired tasks, domain-specific linguistic rules are created such as grammar rules, syntax patterns, semantic rules or regular expressions.
  2. Rule Application: The predefined rules are applied to the inputted data to capture matched patterns.
  3. Rule Processing: The text data is processed in accordance with the results of the matched rules to extract information, make decisions or other tasks.
  4. Rule refinement: The created rules are iteratively refined by repetitive processing to improve accuracy and performance. Based on previous feedback, the rules are modified and updated when needed.

Steps in Rule-Based Approach

Libraries that can be used for a rule-based approach are: Spacy(Best suited for production), fast.ai, NLTK(Not preferred nowadays)
In this article, we’ll work with the Spacy library to demonstrate the Rule-based Approach. Spacy is an open-source software library designed for advanced Natural Language Processing (NLP) tasks. It is built in Python and provides a wide range of functionalities for processing and analyzing large volumes of text data

A rule-matching engine in Spacy called the Matcher can work over tokens, entities, and phrases in a manner similar to regular expressions.

Spacy Installation:

# Spacy Installation
!pip install - U spacy
!pip install - U spacy-lookups-data
!python - m spacy download en_core_web_sm  # For English language

Example 1: Matching Token with Rule-based Approach

Step 1: The necessary modules are imported

Python3




#import modules
import spacy
#import the Matcher
from spacy.matcher import Matcher 
#import the Span class
from spacy.tokens import Span


Step 2: The English Language Spacy model is loaded

Python3




#The English model 'en_core_web_sm' is loaded
spacy = spacy.load("en_core_web_sm")


Step 3: The input text is added and all the tokens are separated.

Python3




#The input text as a Document object
txt ="Natural Language Processing serves as an interrelationship between human language and computers. Natural Language Processing is a subfield of Artificial Intelligence that helps machines process, understand and generate natural language intuitively."
doc = spacy(txt)
Tokens = []
for token in doc:
    Tokens.append(token)
  
print('Tokens:',Tokens)
print('Number of token :',len(Tokens))


Output:

Tokens: [Natural, Language, Processing, serves, as, an, interrelationship, between, human, 
language, and, computers, ., Natural, Language, Processing, is, a, subfield, of, Artificial,
 Intelligence, that, helps, machines, process, ,, understand, and, generate, natural, 
 language, intuitively, .]
Number of token : 34

Step 4: The rule-based matching Engine ‘Matcher’ is loaded.

Python3




#Matcher class object instantiation
matcher = Matcher(spacy.vocab)


Step 5: The rule or the pattern to be searched in the text is added. Here the words ‘language’ and ‘human’ are set as patterns.

Python3




#pattern to be searched
pattern = [[{'LOWER': 'language'}],[{'LOWER':'human'}]]


Step 6: The pattern is added to the matcher object using the ‘add’ method with the first parameter as ID and the second parameter as the pattern.

Python3




#adding the pattern/rule to the matcher object
matcher.add("TokenMatch",pattern)


Step 7: The matcher object is called with the ‘doc’ object input text to match the pattern. The result is stored in ‘matches’ variable

Python3




#Matcher object called
#returns match_id, start and stop indexes of the matched words
matches = matcher(doc)


Step 8: The matched results are extracted and printed.

Python3




#Extracting matched results
for m_id, start, end in matches:
    string_id = spacy.vocab.strings[m_id]  
    span = doc[start:end]
    print('match_id:{}, string_id:{}, Start:{}, End:{}, Text:{}'.format(
        m_id, string_id, start, end, span.text)
         )


Output:

match_id:9580390278045680890, string_id:TokenMatch, Start:1, End:2, Text:Language
match_id:9580390278045680890, string_id:TokenMatch, Start:8, End:9, Text:human
match_id:9580390278045680890, string_id:TokenMatch, Start:9, End:10, Text:language
match_id:9580390278045680890, string_id:TokenMatch, Start:14, End:15, Text:Language
match_id:9580390278045680890, string_id:TokenMatch, Start:31, End:32, Text:language

Example 2: Matching Phrases with the Rule-based Approach

Step 1: The PhraseMatcher module is imported from Spacy

Python3




# import necessary modules
import spacy
from spacy.matcher import PhraseMatcher


Step 2: The English Language Spacy model is loaded

Python3




#The English model 'en_core_web_sm' is loaded
spacy = spacy.load('en_core_web_sm')


Step 3: The input text is added as ‘doc’ object

Python3




#The input text as a Document object
txt ="Natural Language Processing serves as an interrelationship between human language and computers. Natural Language Processing is a subfield of Artificial Intelligence that helps machines process, understand and generate natural language intuitively."
doc = spacy(txt)
print(doc)


Output:

Natural Language Processing serves as an interrelationship between human language and computers.
 Natural Language Processing is a subfield of Artificial Intelligence that helps machines process,
  understand and generate natural language intuitively.

Step 4: The PhraseMatcher object is instantiated.

Python3




# PhraseMatcher object creation
matcher = PhraseMatcher(spacy.vocab, attr='LOWER')


Step 5: The list of phrases is added in term_list which is converted to a patterns object using ‘make_doc’ method to speed up the process.

Python3




# list of phrases
term_list = ["Language Processing", "human language"]
# phrases into document object
patterns = [spacy.make_doc(t) for t in term_list]


Step 6: The created rule is added to the matcher object

Python3




# patterns added to the matcher object
matcher.add("Phrase Match", None, *patterns)


Step 7: The matcher object is called on the input text ‘doc’ with parameter ‘is_spans=True’ that returns span objects directly. The extracted results are printed.

Python3




# Matcher object called. It returns Span objects directly
matches = matcher(doc, as_spans=True)
#Extracting matched results
for span in matches:
    print(span.text,":-", span.label_)


Output:

Language Processing :- Phrase Match
human language :- Phrase Match
Language Processing :- Phrase Match

Example 3: Named Entity Recognization with Spacy

Step 1: Import spacy and Load the English Language Spacy model

Python3




# import spacy 
import spacy
#Load the English Language Spacy model
nlp = spacy.load("en_core_web_sm")


Step 2:  Named Entity Recognization with Spacy

Python3




#The input text as a Document object
txt = """
My name is Pawan Kumar Gunjan. I live in India
India, officially the Republic of India, is a country in South Asia.
It is the seventh-largest country by area and the second-most populous country.
Bounded by the Indian Ocean on the south, the Arabian Sea on the southwest, 
and the Bay of Bengal on the southeast, it shares land borders with Pakistan to the west; 
China, Nepal, and Bhutan to the north; and Bangladesh and Myanmar to the east.
"""
doc = nlp(txt)
Tokens = []
for entity in doc.ents:
    print('Text:{}, Label:{}'.format(entity.text, entity.label_))


Output:

Text:Pawan Kumar Gunjan, Label:PERSON
Text:India, Label:GPE
Text:India, Label:GPE
Text:the Republic of India, Label:GPE
Text:South Asia, Label:LOC
Text:seventh, Label:ORDINAL
Text:second, Label:ORDINAL
Text:the Indian Ocean, Label:LOC
Text:the Arabian Sea, Label:LOC
Text:the Bay of Bengal, Label:LOC
Text:Pakistan, Label:GPE
Text:China, Label:GPE
Text:Nepal, Label:GPE
Text:Bhutan, Label:GPE
Text:Bangladesh, Label:GPE
Text:Myanmar, Label:GPE

Advantages of the Rule-based approach:

  • Easily interpretable as rules are explicitly defined
  • Rule-based techniques can help semi-automatically annotate some data in domains where you don’t have annotated data (for example, NER(Named Entity Recognization) tasks in a particular domain).
  • Functions even with scant or poor training data
  • Computation time is fast and it offers high precision
  • Many times, deterministic solutions to various issues, such as tokenization, sentence breaking, or morphology, can be achieved through rules (at least in some languages).

Disadvantages of the Rule-based approach:

  • Labor-intensive as more rules are needed to generalize
  • Generating rules for complex tasks is time-consuming
  • Needs regular maintenance
  • May not perform well in handling variations and exceptions in language usage
  • May not have a high recall metric

Why Rule-based Approach with Machine Learning and Neural Network Approaches?

  1. Rule-based NLP usually deals with edge cases when included with other approaches. 
  2. It helps to speed up the data annotation. For instance, a rule-based technique is used for URL formats, date formats, etc., and a machine learning approach can be used to determine the position of text in a pdf file (including numerical data).
  3. Also, in languages other than English annotated data is really scarce even for common tasks which are carried out by Rule-based NLP. 
  4. By using a rule-based approach, the computation performance of the pipeline is also improved. 
     


Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads