Byte-Pair Encoding (BPE) in NLP

Last Updated : 20 Mar, 2024

Natural Language Processing (NLP) is a subfield of artificial intelligence that gives the ability to machine understand and process human languages. Tokenization is the process of dividing the text into a collection of tokens from a string of text. Tokenization is often the first step in natural language processing tasks such as text classification, named entity recognition, and sentiment analysis. The resulting tokens are typically used as input to further processing steps, such as vectorization, where the tokens are converted into numerical representations for machine learning models to use.

Byte-Pair Encoding (BPE) is a compression algorithm used in Natural Language Processing (NLP) to represent large vocabulary with a small set of subword units. It was introduced by Sennrich et al. in 2016 and has been widely used in various NLP tasks such as machine translation, text classification, and text generation. The basic idea of BPE is to iteratively merge the most frequent pair of consecutive bytes or characters in a text corpus until a predefined vocabulary size is reached. The resulting subword units can be used to represent the original text in a more compact and efficient way.

Concepts related to BPE:

Vocabulary: A set of subword units that can be used to represent a text corpus.
Byte: A unit of digital information that typically consists of eight bits.
Character: A symbol that represents a written or printed letter or numeral.
Frequency: The number of times a byte or character occurs in a text corpus.
Merge: The process of combining two consecutive bytes or characters to create a new subword unit.

Steps involved in BPE:

Initialize the vocabulary with all the bytes or characters in the text corpus
Calculate the frequency of each byte or character in the text corpus.
Repeat the following steps until the desired vocabulary size is reached:
1. Find the most frequent pair of consecutive bytes or characters in the text corpus
2. Merge the pair to create a new subword unit.
3. Update the frequency counts of all the bytes or characters that contain the merged pair.
4. Add the new subword unit to the vocabulary.
Represent the text corpus using the subword units in the vocabulary.

How Byte-Pair Encoding (BPE) works:

Suppose we have a text corpus with the following four words: “ab”, “bc”, “bcd”, and “cde”. The initial vocabulary consists of all the bytes or characters in the text corpus: {“a”, “b”, “c”, “d”, “e”}.

Step 1: Initialize the vocabulary

Vocabulary = {"a", "b", "c", "d", "e"}

Step 2: Calculate the frequency

Frequency = {"a": 1, "b": 2, "c": 3, "d": 2, "e": 1}

Step 3a: Find the most frequent pair of two characters

The most frequent pair is "bc" with a frequency of 2.

Step 3b: Merge the pair

Merge "bc" to create a new subword unit "bc".

Step 3c: Update frequency counts

Update the frequency counts of all the bytes or characters that contain “bc”:

Frequency = {"a": 1, "b": 2, "c": 3, "d": 2, "e": 1, "bc": 2}

Step 3d: Add the new subword unit to the vocabulary

Add “bc” to the vocabulary:

Vocabulary = {"a", "b", "c", "d", "e", "bc"}

Repeat steps 3a-3d until the desired vocabulary size is reached.

Step 4: Represent the text corpus using subword units

The resulting vocabulary consists of the following subword units: {"a", "b", "c", "d", "e", "bc", "cd", "de","ab","bcd","cde"}.

The original text corpus can be represented using these subword units as follows:

"ab" -> "a" + "b"
"bc" -> "bc"
"bcd" -> "bc" + "d"
"cde" -> "c" + "de"

Here’s an implementation of Byte-Pair Encoding (BPE) in Python:

Python3

import re 
from collections import defaultdict 
  
def get_stats(vocab): 
    """ 
    Given a vocabulary (dictionary mapping words to frequency counts), returns a  
    dictionary of tuples representing the frequency count of pairs of characters  
    in the vocabulary. 
    """
    pairs = defaultdict(int) 
    for word, freq in vocab.items(): 
        symbols = word.split() 
        for i in range(len(symbols)-1): 
            pairs[symbols[i],symbols[i+1]] += freq 
    return pairs 
  
def merge_vocab(pair, v_in): 
    """ 
    Given a pair of characters and a vocabulary, returns a new vocabulary with the  
    pair of characters merged together wherever they appear. 
    """
    v_out = {} 
    bigram = re.escape(' '.join(pair)) 
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)') 
    for word in v_in: 
        w_out = p.sub(''.join(pair), word) 
        v_out[w_out] = v_in[word] 
    return v_out 
  
def get_vocab(data): 
    """ 
    Given a list of strings, returns a dictionary of words mapping to their frequency  
    count in the data. 
    """
    vocab = defaultdict(int) 
    for line in data: 
        for word in line.split(): 
            vocab[' '.join(list(word)) + ' </w>'] += 1
    return vocab 
  
def byte_pair_encoding(data, n): 
    """ 
    Given a list of strings and an integer n, returns a list of n merged pairs 
    of characters found in the vocabulary of the input data. 
    """
    vocab = get_vocab(data) 
    for i in range(n): 
        pairs = get_stats(vocab) 
        best = max(pairs, key=pairs.get) 
        vocab = merge_vocab(best, vocab) 
    return vocab 
  
# Example usage: 
corpus = '''Tokenization is the process of breaking down  
a sequence of text into smaller units called tokens, 
which can be words, phrases, or even individual characters. 
Tokenization is often the first step in natural languages processing tasks  
such as text classification, named entity recognition, and sentiment analysis. 
The resulting tokens are typically used as input to further processing steps, 
such as vectorization, where the tokens are converted 
into numerical representations for machine learning models to use.'''
data = corpus.split('.') 
  
n = 230
bpe_pairs = byte_pair_encoding(data, n) 
bpe_pairs

Output:

{'Tokenization</w>': 2,
 'is</w>': 2,
 'the</w>': 3,
 'process</w>': 1,
 'of</w>': 2,
 'breaking</w>': 1,
 'down</w>': 1,
 'a</w>': 1,
 'sequence</w>': 1,
 'text</w>': 2,
 'into</w>': 2,
 'smaller</w>': 1,
 'units</w>': 1,
 'called</w>': 1,
 'tokens,</w>': 1,
 'which</w>': 1,
 'can</w>': 1,
 'be</w>': 1,
 'words,</w>': 1,
 'phrases,</w>': 1,
 'or</w>': 1,
 'even</w>': 1,
 'individual</w>': 1,
 'characters</w>': 1,
 'often</w>': 1,
 'first</w>': 1,
 'step</w>': 1,
 'in</w>': 1,
 'natural</w>': 1,
 'languages</w>': 1,
 'processing</w>': 2,
 'tasks</w>': 1,
 'such</w>': 2,
 'as</w>': 3,
 'classification,</w>': 1,
 'named</w>': 1,
 'entity</w>': 1,
 'recognition,</w>': 1,
 'and</w>': 1,
 'sentiment</w>': 1,
 'analysis</w>': 1,
 'The</w>': 1,
 'resulting</w>': 1,
 'tokens</w>': 2,
 'are</w>': 2,
 'typically</w>': 1,
 'used</w>': 1,
 'input</w>': 1,
 'to</w>': 2,
 'further</w>': 1,
 'steps,</w>': 1,
 'vectorization,</w>': 1,
 'where</w>': 1,
 'converted</w>': 1,
 'numerical</w>': 1,
 'representations</w>': 1,
 'for</w>': 1,
 'machine</w>': 1,
 'learning</w>': 1,
 'models</w>': 1,
 'use</w>': 1}

The output represents the frequency count of merged pairs of characters found in the given corpus using byte pair encoding (BPE) with 230 iterations. Each key-value pair in the dictionary indicates a merged pair of characters along with its frequency count in the corpus. This process iteratively merges the most frequent pair of characters until the specified number of iterations is reached.

Suggest improvement

Encoding in BeautifulSoup

Share your thoughts in the comments