Subword Tokenization in NLP

Last Updated : 31 Jul, 2023

Subword Tokenization is a Natural Language Processing technique(NLP) in which a word is split into subwords and these subwords are known as tokens. This technique is used in any NLP task where a model needs to maintain a large vocabulary and complex word structures. The concept behind this, frequently occurring words should be in the vocabulary whereas rare words are split into frequent subwords. For example, the word “unwanted” might be split into “un”, “want”, and “ed”. The word “football” might be split into “foot”, and “ball”.

Subword Tokenization in NLP

To implement the subword tokenization, we need text data to apply or we can give a small string input for the base test case. First, we tokenize the given sentences, for example, “Geeksforgeeks ! is-best, for. @geeks don’t “.If we simply split it by the space then we see that it creates “is-best” which includes a special character that exploits our vocabulary so we need to fix it. And every special character will also be part of our vocabulary. So the expected vocabulary will be [“geeksforgeeks”,”!”,”is”,”-“,”best”,”for”,”@”,”geeks”] . Here we also need to convert it into lowercase. The below code is the implementation of the above process-

Python3

import re
 
test_str = """
GeeksforGeeks is a fantastic resource for geeks 
who are looking to enhance their programming skills, 
and if you're a geek who wants to become an expert programmer, 
then GeeksforGeeks is definitely the go-to place for geeks like you.
"""
# printing original String
print("The original string is : " + str(test_str))
test_str=test_str.lower()
# using findall() to get all regex matches.
res = re.findall( r'\w+|[^\s\w]+', test_str)
 
# printing result
print("The converted string :\n" + str(res))

Output:

The original string is : 
GeeksforGeeks is a fantastic resource for geeks 
who are looking to enhance their programming skills, 
and if you're a geek who wants to become an expert programmer, 
then GeeksforGeeks is definitely the go-to place for geeks like you.

The converted string :
['geeksforgeeks', 'is', 'a', 'fantastic', 'resource', 'for', 'geeks', 'who', 'are', 'looking', 'to',
'enhance','their', 'programming', 'skills', ',','and', 'if', 'you', "'", 're', 'a', 'geek', 'who', 'wants', 
'to', 'become', 'an', 'expert', 'programmer', ',', 'then', 'geeksforgeeks', 'is', 'definitely', 'the', 'go', 
 '-', 'to', 'place', 'for', 'geeks', 'like', 'you', '.']

Since we are taking each word. it creates a large dictionary and because of this, word tokenization can have an exploding vocabulary problem. To get rid of this problem we use tokenization on characters. Character tokens solve this large vocabulary problem. For that, we need to create a dictionary that has the frequency of each word in the sentence after the word tokenization and separate each word by space. The below code is the implementation of the above process.

Python3

from collections import OrderedDict
res_dict=OrderedDict()
for i in res:
    new_string=' '.join(char for char in i)
    if new_string in res_dict:
        res_dict[new_string]+=1
    else:
        res_dict[new_string]=1
res_dict

Output:

OrderedDict([('g e e k s f o r g e e k s', 2),
             ('i s', 2),
             ('a', 2),
             ('f a n t a s t i c', 1),
             ('r e s o u r c e', 1),
             ('f o r', 2),
             ('g e e k s', 2),
             ('w h o', 2),
             ('a r e', 1),
             ('l o o k i n g', 1),
             ('t o', 3),
             ('e n h a n c e', 1),
             ('t h e i r', 1),
             ('p r o g r a m m i n g', 1),
             ('s k i l l s', 1),
             (',', 2),
             ('a n d', 1),
             ('i f', 1),
             ('y o u', 2),
             ("'", 1),
             ('r e', 1),
             ('g e e k', 1),
             ('w a n t s', 1),
             ('b e c o m e', 1),
             ('a n', 1),
             ('e x p e r t', 1),
             ('p r o g r a m m e r', 1),
             ('t h e n', 1),
             ('d e f i n i t e l y', 1),
             ('t h e', 1),
             ('g o', 1),
             ('-', 1),
             ('p l a c e', 1),
             ('l i k e', 1),
             ('.', 1)])

Byte-Pair Encoding (BPE)

In natural language processing, Byte-Pair Encoding (BPE) is a popular subword tokenization approach. The most common pair of bytes in a corpus are iteratively combined until the necessary vocabulary size is attained. Through this process, a collection of subword units is created, which can be used to represent words as a list of subword tokens. Words are broken down into smaller components known as subword tokens through the process of subword tokenization. Subword tokenization is used to handle unusual and out-of-vocabulary (OOV) words and minimize vocabulary size.

It counts the occurrence of every symbol pair and picks the one with the highest frequency. In this example (‘g e e k s f o r g e e k s’, 2), (‘g e e k s’, 2), (‘y o u’, 2) , ‘g e’ occurs 4 times (2 times in geeksforgeeks and 2 times in geeks). For a better understanding, I have taken each word’s frequency as 1. The below code is the implementation of the Byte-Pair Encoding:

Python3

import re, collections
 
def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs
 
 
def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out
 
 
def byte_pair_encoding(vocab):
    pairs = get_stats(vocab)
     
    while len(pairs.values()) != 0:
        # Max value
        Max = max(list(pairs.values()))
 
        # Find the key(s) that correspond to Max
        best_pair = []
        for key, value in pairs.items():
            if value == Max:
                best_pair.append(key)
 
        # Pair the most frequent pairs
        for pair in best_pair:
            vocab = merge_vocab(pair, vocab)
            #print(pair,':',Max)
        pairs = get_stats(vocab)
    return vocab.keys()
     
byte_pair_encoding(res_dict)

Output:

dict_keys(['geeksforgeeks', 'is', 'a', 'fantastic', 'resource', 'for', 'geeks', 
'who', 'are', 'looking', 'to', 'enhance', 'their', 'programming', 'skills', ',', 'and', 
'if', 'you', "'", 're', 'geek', 'wants', 'become', 'an', 'expert', 'programmer', 
'then', 'definitely', 'the', 'go', '-', 'place', 'like', '.'])

Suggest improvement

Rule-Based Tokenization in NLP

Share your thoughts in the comments

Subword Tokenization in NLP