Natural Language Processing (NLP) is a subfield of artificial intelligence that gives the ability to machine understand and process human languages. Tokenization is the process of dividing the text into a collection of tokens from a string of text. Tokenization is often the first step in natural language processing tasks such as text classification, named entity recognition, and sentiment analysis. The resulting tokens are typically used as input to further processing steps, such as vectorization, where the tokens are converted into numerical representations for machine learning models to use.
Byte-Pair Encoding (BPE) is a compression algorithm used in Natural Language Processing (NLP) to represent large vocabulary with a small set of subword units. It was introduced by Sennrich et al. in 2016 and has been widely used in various NLP tasks such as machine translation, text classification, and text generation. The basic idea of BPE is to iteratively merge the most frequent pair of consecutive bytes or characters in a text corpus until a predefined vocabulary size is reached. The resulting subword units can be used to represent the original text in a more compact and efficient way.
Concepts related to BPE:
- Vocabulary: A set of subword units that can be used to represent a text corpus.
- Byte: A unit of digital information that typically consists of eight bits.
- Character: A symbol that represents a written or printed letter or numeral.
- Frequency: The number of times a byte or character occurs in a text corpus.
- Merge: The process of combining two consecutive bytes or characters to create a new subword unit.
Steps involved in BPE:
- Initialize the vocabulary with all the bytes or characters in the text corpus
- Calculate the frequency of each byte or character in the text corpus.
- Repeat the following steps until the desired vocabulary size is reached:
- Find the most frequent pair of consecutive bytes or characters in the text corpus
- Merge the pair to create a new subword unit.
- Update the frequency counts of all the bytes or characters that contain the merged pair.
- Add the new subword unit to the vocabulary.
- Represent the text corpus using the subword units in the vocabulary.
How Byte-Pair Encoding (BPE) works:
Suppose we have a text corpus with the following four words: “ab”, “bc”, “bcd”, and “cde”. The initial vocabulary consists of all the bytes or characters in the text corpus: {“a”, “b”, “c”, “d”, “e”}.
Step 1: Initialize the vocabulary
Vocabulary = {"a", "b", "c", "d", "e"}
Step 2: Calculate the frequency
Frequency = {"a": 1, "b": 2, "c": 3, "d": 2, "e": 1}
Step 3a: Find the most frequent pair of two characters
The most frequent pair is "bc" with a frequency of 2.
Step 3b: Merge the pair
Merge "bc" to create a new subword unit "bc".
Step 3c: Update frequency counts
Update the frequency counts of all the bytes or characters that contain “bc”:
Frequency = {"a": 1, "b": 2, "c": 3, "d": 2, "e": 1, "bc": 2}
Step 3d: Add the new subword unit to the vocabulary
Add “bc” to the vocabulary:
Vocabulary = {"a", "b", "c", "d", "e", "bc"}
Repeat steps 3a-3d until the desired vocabulary size is reached.
Step 4: Represent the text corpus using subword units
The resulting vocabulary consists of the following subword units: {"a", "b", "c", "d", "e", "bc", "cd", "de","ab","bcd","cde"}.
The original text corpus can be represented using these subword units as follows:
"ab" -> "a" + "b"
"bc" -> "bc"
"bcd" -> "bc" + "d"
"cde" -> "c" + "de"
Here’s an implementation of Byte-Pair Encoding (BPE) in Python:
Python3
import re
from collections import defaultdict
def get_stats(vocab):
pairs = defaultdict( int )
for word, freq in vocab.items():
symbols = word.split()
for i in range ( len (symbols) - 1 ):
pairs[symbols[i],symbols[i + 1 ]] + = freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = re.escape( ' ' .join(pair))
p = re. compile (r '(?<!\S)' + bigram + r '(?!\S)' )
for word in v_in:
w_out = p.sub(''.join(pair), word)
v_out[w_out] = v_in[word]
return v_out
def get_vocab(data):
vocab = defaultdict( int )
for line in data:
for word in line.split():
vocab[ ' ' .join( list (word)) + ' </w>' ] + = 1
return vocab
def byte_pair_encoding(data, n):
vocab = get_vocab(data)
for i in range (n):
pairs = get_stats(vocab)
best = max (pairs, key = pairs.get)
vocab = merge_vocab(best, vocab)
return vocab
corpus =
data = corpus.split( '.' )
n = 230
bpe_pairs = byte_pair_encoding(data, n)
bpe_pairs
|
Output:
{'Tokenization</w>': 2,
'is</w>': 2,
'the</w>': 3,
'process</w>': 1,
'of</w>': 2,
'breaking</w>': 1,
'down</w>': 1,
'a</w>': 1,
'sequence</w>': 1,
'text</w>': 2,
'into</w>': 2,
'smaller</w>': 1,
'units</w>': 1,
'called</w>': 1,
'tokens,</w>': 1,
'which</w>': 1,
'can</w>': 1,
'be</w>': 1,
'words,</w>': 1,
'phrases,</w>': 1,
'or</w>': 1,
'even</w>': 1,
'individual</w>': 1,
'characters</w>': 1,
'often</w>': 1,
'first</w>': 1,
'step</w>': 1,
'in</w>': 1,
'natural</w>': 1,
'languages</w>': 1,
'processing</w>': 2,
'tasks</w>': 1,
'such</w>': 2,
'as</w>': 3,
'classification,</w>': 1,
'named</w>': 1,
'entity</w>': 1,
'recognition,</w>': 1,
'and</w>': 1,
'sentiment</w>': 1,
'analysis</w>': 1,
'The</w>': 1,
'resulting</w>': 1,
'tokens</w>': 2,
'are</w>': 2,
'typically</w>': 1,
'used</w>': 1,
'input</w>': 1,
'to</w>': 2,
'further</w>': 1,
'steps,</w>': 1,
'vectorization,</w>': 1,
'where</w>': 1,
'converted</w>': 1,
'numerical</w>': 1,
'representations</w>': 1,
'for</w>': 1,
'machine</w>': 1,
'learning</w>': 1,
'models</w>': 1,
'use</w>': 1}
The output represents the frequency count of merged pairs of characters found in the given corpus using byte pair encoding (BPE) with 230 iterations. Each key-value pair in the dictionary indicates a merged pair of characters along with its frequency count in the corpus. This process iteratively merges the most frequent pair of characters until the specified number of iterations is reached.
Share your thoughts in the comments
Please Login to comment...