Open In App

Rule-Based Tokenization in NLP

Improve
Improve
Like Article
Like
Save
Share
Report

Natural Language Processing (NLP) is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dictionary-based tokenization is a common method used in NLP to segment text into tokens based on a pre-defined dictionary.

Tokenization is the process of splitting text into individual tokens, usually words or sentences and separating them from one another using spaces or punctuation or some specific rules. In rule-based tokenization, a set of rules is defined to determine how text is split into tokens. These rules can be based on various factors such as whitespace, punctuation, and context.

Rule-Based Tokenization:

Rule-based tokenization is a technique where a set of rules is applied to the input text to split it into tokens. These rules can be based on different criteria, such as whitespace, punctuation, regular expressions, or language-specific rules. Here are some common concepts related to rule-based tokenization:

Whitespace tokenization

This approach splits the input text based on whitespace characters such as space, tab, or newline. 

For example, the sentence : "This is a sample text." 
would be split into the following tokens: "This", "is", "a", "sample", and "text."

The following Python code demonstrates whitespace rule-based tokenization:

Steps for Rule-Based Tokenization:

  • Load the input text: The input text can be loaded from a file or entered by the user.
  • Define the tokenization rules: Based on the type of tokenization required, define the rules to split the input text into tokens. These rules can be based on whitespace, punctuation, regular expressions, or language-specific rules.
  • Apply the rules to the input text: Use the defined rules to split the input text into tokens.
  • Output the tokens: Output the tokens generated by the tokenization process.

Python3




# Step 1: Load the input text
text = "The quick brown fox jumps over the lazy dog."
  
# Step 2: Define the tokenization rules (split on whitespace)
tokens = text.split()
  
# Step 4: Output the tokens
print(tokens)


Output:

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

Regular expression tokenization

This approach uses regular expressions to split the input text based on a pattern. This is mainly used when we have to find some specific type of patterns in text like email id, phone number, order id, currency, etc.

For example, the regular expression "[\w]+-[\w]+-[\w]+" will match the "Geeks-for-Geeks" 
and ([\w\.-]+@[\w]+.[\w]+) will match the email id. from
"Hello, I am working at Geeks-for-Geeks and my email is pawan.gunjan123@geeksforgeeks.com." 

The following Python code demonstrates whitespace Regular expression tokenization:

Python3




import re
  
#Load the input text
text = "Hello, I am working at Geeks-for-Geeks and my email is pawangunjan23@geeksforgeeks.com."
  
#Define the regular expression pattern
p='([\w]+-[\w]+-[\w]+)|([\w\.-]+@[\w]+.[\w]+)'
  
# Find matches
matches = re.findall(p, text)
# print output
for match in matches:
    if match[0]:
        print(f"Company Name: {match[0]}")
    else:
        print(f"Email address: {match[1]}")


Output:

Company Name: Geeks-for-Geeks
Email address: pawan.gunjan123@geeksforgeeks.com

Punctuation tokenization

This approach splits the input text based on punctuation characters such as period, comma, or semicolon. 

For example, the sentence "Hello Geeks! How can I help you?" 
would be split into the following tokens: 'Hello', 'Geeks', 'How', 'can', 'I', 'help', 'you'

The following Python code demonstrates punctuation rule-based tokenization:

Python3




import re
  
# Load the input text
text = "Hello Geeks! How can I help you?"
  
# Define the regular expression pattern
# Matches one or more non-alphanumeric characters
pattern = r'\W+' 
  
# Remove the punctuation and get the resulting string
result = re.sub(pattern, ' ', text)
  
# tokenize
tokens = re.findall(r'\b\w+\b|[^\w\s]', result)
  
# Print the result
print(tokens)


Output:

['Hello', 'Geeks', 'How', 'can', 'I', 'help', 'you']

Language-specific tokenization

This approach uses language-specific rules to split the input text into tokens. For example, in some languages, words can be concatenated without spaces, such as in German. Therefore, language-specific rules are needed to split the input text into meaningful tokens.

Python3




from inltk.inltk import tokenize
from inltk.inltk import setup
setup('sa')
  
Text = "'ॐ भूर्भव: स्व: तत्सवितुर्वरेण्यं भर्गो देवस्य धीमहि धियो यो न: प्रचोदयात्।'"
# tokenize(input text, language code)
tokenize(Text, "sa")


Output:

["▁'",
 'ॐ',
 '▁भू',
 'र्',
 'भव',
 ':',
 '▁स्व',
 ':',
 '▁तत्',
 'स',
 'वि',
 'तु',
 'र्',
 'वरेण्य',
 'ं',
 '▁भ',
 'र्ग',
 'ो',
 '▁देवस्य',
 '▁धीम',
 'हि',
 '▁',
 'धि',
 'यो',
 '▁यो',
 '▁न',
 ':',
 '▁प्र',
 'च',
 'ोदय',
 'ात्',
 "।'"]


Last Updated : 04 Jun, 2023
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads