[토크나이저 핵심색인마킹]
Contents
1. 서론
현재 많은 NLP 모델들은 텍스트를 서브워드 단위로 토크나이징하여 처리합니다.
이 과정은 효율적인 언어 이해와 생성을 위한 기본적인 전처리 단계로, Out-Of-Vocabulary(OOV) 문제를 줄이는 데 기여합니다.
본 논문에서는 BERT에서 사용되는 WordPiece tokenization 방법을 대상으로, 기존의 최적화되지 않은 알고리즘을 개선한 새로운 방법을 제안합니다.
2. 선행 연구
WordPiece는 기존에 중국어 분절에서 사용되던 MaxMatch 알고리즘을 기반으로 합니다.
이 알고리즘은 최대 매칭 길이를 우선적으로 선택하여 토큰화를 수행하며, 이는 서브워드 tokenization에서도 유용하지만, 이전 알고리즘들은 대체로 \(O(n^2)\) 또는 \(O(nm)\)의 시간 복잡도를 가지고 있어 대용량 텍스트 처리에는 비효율적일 수 있습니다.
3. LinMaxMatch 알고리즘
LinMaxMatch 알고리즘은 Aho-Corasick 알고리즘에 기반하여 트라이 자료구조를 사용합니다. 각 노드에 실패 링크(failure link)와 실패 팝(failure pop)을 도입하여, 매칭 실패 시 불필요한 백트래킹 없이 다음 가능한 노드로 빠르게 전환할 수 있습니다.
3.1 실패 링크와 실패 팝
실패 링크는 현재 노드에서 매칭 실패 시 이동할 다음 노드를 가리킵니다. 실패 팝은 현재 노드까지 매칭된 문자열 중 토큰으로 인식된 부분을 저장합니다. 이런 구조는 tokenization 과정에서 발생할 수 있는 불필요한 재검사를 최소화하여 처리 속도를 향상시킵니다.
3.2 tokenization 프로세스
입력 문자열은 트라이를 통해 한 문자씩 매칭을 시도하며 진행됩니다. 매칭이 실패할 경우, 실패 링크를 따라 이동하고, 실패 팝에 저장된 토큰들을 추출하여 결과 리스트에 추가합니다. 이 과정을 문자열의 끝까지 반복함으로써 \(O(n)\)의 시간 복잡도로 tokenization합니다.
4. E2E WordPiece tokenization
전처리 단계와 WordPiece tokenization을 하나의 단계로 통합한 E2E WordPiece 알고리즘을 제안합니다. 이 방법은 입력을 한 번만 처리하여 효율성을 극대화하며, 모든 tokenization 작업을 선형 시간 안에 처리할 수 있습니다.
5. 실험 및 결과
HuggingFace의 토크나이저와 TensorFlow Text와의 비교 실험을 통해, 제안된 방법이 기존 방법보다 각각 8.2배, 5.1배 빠른 tokenization 성능을 보임을 확인했습니다. 이런 성능 향상은 특히 모바일 NLP 애플리케이션 또는 대규모 웹 서비스에서의 응답 시간 개선에 기여할 수 있습니다.
6. 결론
본 논문에서 제안한 LinMaxMatch 및 E2E WordPiece 알고리즘은 서브워드 tokenization의 효율성을 향상시키며 NLP 모델의 전체적인 성능 향상에 기여합니다. 특히 대용량 텍스트 처리가 필요한 애플리케이션에 있어 중요한 발전을 의미합니다. 또한, 이 알고리즘은 다른 문자열 매칭 또는 재작성 문제에도 적용 가능하므로 범용적으로 사용할 수 있습니다.
[참고자료 1] MaxMatch Algorithm(최대 매칭 알고리즘) 이해하기
MaxMatch Algorithm은 공백 없이 연결된 단어를 분리하는 데 주로 사용됩니다. 예를 들어 중국어와 같은 언어나 잘못된 공백이 있는 영어 텍스트에서 사용됩니다. 이 알고리즘은 주어진 문자열에서 왼쪽부터 시작하여 가능한 가장 긴 단어를 찾는 것을 목표로 합니다.
알고리즘이 작동하는 순서도
예시
“thisisinsane”이라는 문자열을 가지고 알고리즘은 다음과 같이 작동합니다.
최종 분리된 출력은 [‘this’, ‘is’, ‘insane’]입니다.
파이썬 구현
nltk.corpus
를 사용하여 영어 단어 vocab에 접근합니다.코드
from nltk.corpus import words
string = "thisisinsane" # 분리할 문자열 입력
tokens = [ ] # 세그먼트를 저장할 리스트
lowercaseCorpus = [x.lower() for x in words.words()] # vocab 단어를 소문자로 변환
i = 0
while i < len(string): # 문자열의 각 문자를 끝까지 반복
maxWord = "" # 발견된 가장 긴 단어를 초기화
for j in range(i, len(string)): # 인덱스 'i'에서 시작하는 각 가능한 단어를 확인
tempWord = string[i:j+1] # 인덱스 i에서 j까지의 단어 형성
if tempWord in lowercaseCorpus and len(tempWord) > len(maxWord): # 형성된 단어가 vocab에 있고 이전에 찾은 단어보다 길다면
maxWord = tempWord # 가장 긴 단어 갱신
i += len(maxWord) # 가장 긴 단어의 끝으로 인덱스 이동
tokens.append(maxWord) # 가장 긴 단어를 토큰 리스트에 추가
print(tokens) # 세그먼트된 단어 출력
출력
['this', 'is', 'insane']
입력 문자열의 각 문자를 반복하면서 모든 가능한 부분 문자열을 알려진 단어 목록과 비교하고, 유효한 단어를 찾을 때마다 그 단어가 같은 시작점에서 찾은 단어들 중 가장 길 경우 해당 단어를 vocab에 추가합니다.
가장 긴 단어를 확인한 후에는 이 단어를 넘어서 계속해서 프로세스를 진행합니다.
이 방식은 공백 없이 연결된 텍스트를 이해하기 쉬운 단어로 효과적으로 분리하는 데 사용되며, 특히 단어 사이에 공백이 없는 언어나 텍스트를 처리할 때 유용하게 사용됩니다.
[참고자료 2] Fast WordPiece Tokenizer & Aho-Corasick
Fast WordPiece Tokenizer는 BERT와 같은 언어 모델에서 사용되는 서브워드 기반의 토큰화 방법입니다. 이 토크나이저는 문자열을 더 작은 단위인 서브워드로 분할하며, 자주 등장하는 단어는 그대로 유지하고, 드물게 등장하는 단어는 더 작은 의미 단위로 분해합니다. 이 방식은 어휘 집합의 크기를 효율적으로 관리하면서 다양한 단어의 변형을 더 잘 처리할 수 있게 해줍니다.
vocabulary == vocab
[Fast WordPiece Tokenizer]
##
으로 시작하는 서브워드)을 사용하여 분할합니다.입력: “tokenization”
[Aho-Corasick 알고리즘]
Aho-Corasick 알고리즘은 텍스트에서 여러 키워드를 효과적으로 검색하는 데 사용됩니다. 이 알고리즘은 모든 키워드를 트라이(trie) 자료 구조에 저장하고, 트라이를 통해 텍스트를 한 번만 스캔하여 모든 키워드의 발생을 찾으며, 특히 여러 패턴을 동시에 검색할 때 유용합니다.
Aho-Corasick 알고리즘 예시 코드
from collections import deque
class TrieNode:
def __init__(self):
self.children = {}
self.fail_link = None
self.output_link = None
class AhoCorasick:
def __init__(self, words):
self.root = TrieNode()
self.build_trie(words)
self.build_failure_links()
def build_trie(self, words):
for word in words:
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.output_link = word # 마지막 노드에 단어 표시
def build_failure_links(self):
queue = deque([self.root])
while queue:
current = queue.popleft()
for char, child in current.children.items():
queue.append(child)
fail = current.fail_link
while fail and char not in fail.children:
fail = fail.fail_link
child.fail_link = fail.children[char] if fail else self.root
if child.fail_link.output_link:
child.output_link = child.fail_link.output_link
def search(self, text):
node = self.root
results = [ ]
for char in text:
while node and char not in node.children:
node = node.fail_link
if node:
node = node.children[char]
if node.output_link:
results.append(node.output_link)
return results
aho = AhoCorasick(["he", "she", "his", "hers"])
found = aho.search("ushers")
print(found)
출력
['he', 'she', 'hers']
이 코드는 여러 키워드가 포함된 트라이를 생성하고, 실패 링크를 구성하여 텍스트를 효율적으로 검색합니다. 텍스트 내에서 여러 키워드가 발견될 때마다 해당 키워드를 출력하여 복잡도를 낮춥니다.
Tokenization is the process of splitting text into smaller units called tokens (e.g., words). It is a fundamental preprocessing step for almost all NLP applications: sentiment analysis, question answering, machine translation, information retrieval, etc. Modern NLP models like BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020), and XLNet (Yang et al., 2019) tokenize text into subword units (Schuster and Nakajima, 2012; Sennrich et al., 2016; Kudo, 2018). As a midpoint between words and characters, subword units retain linguistic meaning (like morphemes), while alleviating out-of-vocabulary situations even with a relatively small-size vocabulary.
In this paper, we propose efficient algorithms for WordPiece, the subword tokenization used in BERT (Devlin et al., 2019). Given Unicode text that has already been cleaned up and normalized, WordPiece has two steps: (1) pre-tokenize the text into words (by splitting on punctuation and whitespaces), and (2) tokenize each word into wordpieces. For single-word tokenization, WordPiece uses a greedy longest-match-first strategy: iteratively pick the longest prefix of the remaining text that matches a vocabulary token. This is well-known as Maximum Matching or MaxMatch (Palmer, 2000), which has also been used for Chinese word segmentation since 1980s (Liu and Liang, 1986).
Despite its wide use in NLP for decades, to the best of our knowledge, the most efficient MaxMatch algorithms so far are 𝑂(𝑛2) (where 𝑛 is the input word length) or 𝑂(𝑛𝑚) (where 𝑚 is the maximum vocabulary token length) (see Section 2). It’s worth noting that the latter has a vocabularyspecific multiplicative factor 𝑚, which can be large when the vocabulary contains long words.
We propose LinMaxMatch, a novel MaxMatch algorithm for WordPiece tokenization, whose tokenization time is strictly 𝑂(𝑛) without any vocabulary-specific multiplicative factors. Inspired by the Aho-Corasick algorithm (Aho and Corasick, 1975), we organize vocabulary tokens in a trie (Fredkin, 1960) and introduce precomputed failure links and failure pops. During tokenization, if an input character does not match any trie edge, we perform smart transitions to avoid backtracking to earlier input characters. This involves collecting the recognized tokens (i.e., failure pops) and moving to a trie node (via the failure link), from where we continue to match the same character (Section 3).
For general text tokenization, referred to as end-to-end tokenization in this paper, we propose E2E WordPiece, an end-to-end algorithm that combines pre-tokenization and WordPiece tokenization into a single, linear-time pass (Section 4).
Experimental results show that our method is 8.2x faster than HuggingFace Tokenizers (HuggingFace, 2020) and 5.1x faster than TensorFlow Text (Google, 2020) on average for general text tokenization (Section 5).
Although tokenization is relatively faster than other steps, it’s still worth improving the performance: Tokenization is a prerequisite step for almost all NLP tasks, and any improvement on its efficiency helps reduce the latency of the entire inference. One potential impact of the work, for example, is on mobile NLP applications. Ondevice models are generally highly optimized for reducing latency, e.g., by distilling or compressing larger models. Thus, the impact of tokenization can be significant here. Another impact is on aggregate computational savings for Web services like Google, Facebook, Twitter, etc. For example, Google uses BERT to power its Web search nowadays.1 Google serves billions of search queries per day, and it processes hundreds of trillions of Web pages in index building. By employing a faster tokenization system, the aggregate computational savings would be material, which also benefits the environment (for less power consumption).
This paper also makes a theoretical contribution. The proposed LinMaxMatch algorithm solves the decades-old MaxMatch problem in the optimal 𝑂(𝑛) time, and the idea is applicable to other string matching or rewriting problems (Section 3.6).
The code will be available at https://www.tensorflow.org/text.
Maximum Matching (or MaxMatch) has been used for Chinese word segmentation (CWS) since the 1980s (Liu and Liang, 1986; Palmer, 2000). Recent CWS work focuses on machine learning-based segmentation approaches, but MaxMatch remains a commonly referenced baseline (Chang et al., 2008). More recently, subword tokenization techniques have become a near-universal feature of modern NLP models, including BERT (Devlin et al., 2019), GPT-3 (Brown et al., 2020), XLNet (Yang et al., 2019), etc. Common subword tokenization techniques include Byte-Pair Encoding (BPE) (Schuster and Nakajima, 2012; Sennrich et al., 2016), SentencePiece (Kudo, 2018) (based on unigram language modeling), and WordPiece (Google, 2018). The widely-adopted MaxMatch algorithm, which is used in the original WordPiece algorithm (Google, 2018), starts from the longest possible prefix and decrements the length in search of the longest-matching token (Jie et al., 1989). A variant starts from the shortest substring and increases the length (Webster and Kit, 1992; Reps, 1998; Sassano, 2014). The worst-case time complexity of the previous algorithms are \(O(n^2)\) or \(O(nm)\) or even higher than that. For example, the complexity of Sassano (2014) is \(O(nm)\) (in our notations), Since Lookup(t,c,i,N) (Figure 1 in their paper) may take $O(m)$ time (which is similar to the analysis in Section 3.2 of this paper). Reps (1998) recognizes maximum matching tokens using regular expressions in the context of compilers; their complexity is $O(\lvert Q\rvert n)$, where $\lvert Q\rvert$ is the number of states in the automaton built from the grammar/vocabulary. If applied to WordPiece tokenization, since vocabulary tokens are finite strings, their complexity can be refined as $O(nm)$.
Our algorithm is inspired by the Aho-Corasick algorithm (Aho and Corasick, 1975), but the two algorithms are designed to address different problems. Aho-Corasick is not optimal for the MaxMatch problem. In the worst-case scenario where every substring in the input matches a vocabulary token, Aho-Corasick finds a quadratic number of matches, resulting in an overall quadratic complexity for MaxMatch. By comparison, our algorithm achieves the worst-case linear complexity for MaxMatch due to a novel definition of failure links, the newly-introduced failure pops, as well as a different way of emitting tokens.
It’s worth clarifying the difference between our failure links and the tabular solution of Reps (1998). In their work, a table called failed_previously
is used to store whether a state \(\langle q, i \rangle\) has been seen before in a failed attempt to match a token (where \(q\) is a state of the automaton and \(i\) is a position of the input). Reps (1998) uses that table to avoid wasteful revisits of the same state. The table entries \(\langle q, i \rangle\) depend on both the grammar/vocabulary and the actual input. In contrast, our failure links capture which state to transit to when trie matching cannot continue (Definition 1), and they are precomputed based on the vocabulary only, independent of the input.
Finally, we discuss the complexity of algorithms for Byte-Pair Encoding (BPE) (Schuster and Nakajima, 2012; Sennrich et al., 2016) and SentencePiece (Kudo, 2018). Note that they are different problems from MaxMatch (the topic of this paper). SentencePiece is based on unigram language modeling, and the optimal segmentation can be found in \(O(nm)\) time with the Viterbi algorithm (Viterbi, 1967). BPE algorithms can be implemented in two ways. One is to enumerate the symbol pairs in the order that they were added to the vocabulary in the building phase. For each symbol pair, we scan the current sequence and replace all their occurrences with the merged symbol. The complexity is \(O(\lvert V\rvert n)\), where \(\lvert V\rvert\) is the size of the vocabulary. The other approach is to repeatedly select the pair of symbols from the current sequence that has the highest priority (e.g., the maximum frequency). Using a heap, this approach can be done in $O(n \log n)$.
2 The exact complexity depends on implementation details, e.g., whether substring hashes are computed from scratch or incrementally, how substrings are searched in vocabulary, etc.
3 Previous studies usually do not explicitly state the vocabulary-related multiplicative factor in the complexity, or just treat it as a hidden constant.
In this section, we present LinMaxMatch, an 𝑂(𝑛) algorithm for single-word WordPiece tokenization.
Given a vocabulary, WordPiece tokenizes a word using the MaxMatch approach: iteratively pick the longest prefix of the remaining text that matches a vocabulary token until the entire word is segmented. If a word cannot be tokenized, the entire word is mapped to a special token <unk>
.
WordPiece tokenization distinguishes wordpieces at the start of a word from wordpieces starting in the middle. The latter start with a special symbol ##
(in BERT), which is called the suffix indicator and is denoted as ♯
in this paper. Our method works with any suffix indicator: ##, an arbitrary string, or the empty string (i.e., no distinction between the two kinds of wordpieces).
For example, the word “johanson” may be tokenized as \([ \text{johan}, \text{##son} ]\).
We use the running example from Figure 1. Table 1 summarizes our notations. We construct a trie from the vocabulary \(V\). We use \(\delta(u, c) = v\) to denote a trie edge from node \(u\) to node \(v\) with character \(c\) as the label. If there is no outgoing edge from \(u\) with label \(c\), \(\delta(u, c) = \emptyset\). Let \(\chi_v\) be the string represented by the node \(v\), that is, the string obtained by concatenating all the edge labels along the path from the root to node \(v\). Let \(r\) be the root of the trie and \(r♯\) be the node for the suffix indicator ♯. Obviously, \(\chi_r = \epsilon\) (where \(\epsilon\) denotes the empty string) and \(\chi_{r♯} = ♯\). The depth of node \(v\) is defined as the number of characters in \(\chi_v\) excluding the suffix indicator prefix (if any). Hence, the depth of \(r\) or \(r♯\) is 0. In Figure 1a, nodes 0 and 2 have depth 0, nodes 1, 3, and 8 have depth 1, node 10 has depth 2, etc.
Symbol | Meaning |
---|---|
$\epsilon$ | The empty string |
$r^\sharp$ | The suffix indicator string |
$\mathcal{V}$ | The vocabulary |
<unk> |
The unknown token |
$,$, $.$ , $?$ , $!$ , $:$ , $( )$ |
Punctuation characters |
$s$ | A string |
$c$ | A character |
$\text{whitespace}$ | A whitespace character |
$r, r^\sharp$ | The trie root and the node for $r^\sharp$ |
$u$ | Trie nodes; $r$ is often the parent of $u$ |
$\emptyset$ | Null node |
$(u, c)$ | Trie edge from node $u$, with label $c$ |
$\chi_u$ | The string represented by node $u$ |
$(f(u), F(u))$ | Failure link and failure pops |
$n$ | The length of the input |
$m$ | The maximum length of tokens in $\mathcal{V}$ |
$M$ | The sum of the lengths of tokens in $\mathcal{V}$ |
Table 1: Notations.
4 The construction of the vocabulary is outside the scope of this paper. We refer the interested reader to Google (2020).
To motivate our linear algorithm, let’s first consider an alternative approach to MaxMatch using a simple vocabulary trie: when searching the longest token at a position, it starts from the shortest substring and iterates over the input text from left to right, following trie matching to find the longest prefixes that match a vocabulary token.
Example 1: Consider the vocabulary and the trie from Figure 1a, with the input string “abcdz”. The expected output is \([ \text{a}, \text{##b}, \text{##c}, \text{##dz} ]\).
Starting from position 0, we follow the trie edges to match the input characters from ‘a’ to ‘d’, arriving at node 6. No trie edge exits node 6 with character ‘z’ as the label. The longest matching prefix seen so far is ‘a’, which is the first recognized token. The challenge of this approach is that, when the trie fails to match the next character, the longest vocabulary token match may be several characters back. As shown in Example 1, from position 0 we’ve matched the prefix “abcd” but found that the longest matching token is ‘a’. When looking for the next token, we reset the start position at character ‘b’ and reprocess “bcd..”, resulting in repetitive and wasteful iterations. The time complexity is \(O(nm)\). The idea of LinMaxMatch is to use precomputed information to avoid reprocessing the characters.
Figure 1: Example vocabulary, the corresponding trie, and the table of auxiliary links and data. The suffix indicator is ##. Node 0 is the root node. Data nodes (in grey) indicate vocabulary tokens, i.e., the represented string is in 𝑉 .
Example 2: For the same example as above, when the trie matching fails at character ‘z’, since “abcd” has been matched, given the vocabulary in use (Figure 1a), we should be able to know that the first two longest-matching tokens are \([ \text{a}, \text{##b} ]\). After collecting the tokens, we should reset our state as if we just matched \(\text{##cd}\) and then continue to match the same character ‘z’. No need to reprocess “bcd”.
Specifically, when trie matching arrives at node \(v\) but cannot continue further, it must have matched the string represented by \(v\) (i.e., \(\chi_v\)). We consider the tokens that MaxMatch would generate for the beginning of \(\chi_v\) (called “failure pops” \(F(v)\)), which should be popped off the beginning of \(\chi_v\) and put into the result. After that, we should transit to a state (following the “failure link” \(f(v)\)) that corresponds to the remaining suffix of \(\chi_v\), from which the algorithm continues to match the next character. \(F(v)\) and \(f(v)\) are defined as below and can be precomputed based on the vocabulary.
Definition 1: Failure links and pops. Given a node \(v\) and the corresponding string \(\chi_v\), consider the shortest non-empty list of longest-matching \(\neq \epsilon\) prefix tokens \([p_1, p_2, ..., p_k]\) (where \(p_i \in V\), \(p_i\) or ♯, for \(1 \leq i \leq k\)) that we can remove from \(\chi_v\) (in order) until the remaining suffix can be represented by some node \(v'\) from the trie.
We define failure pops for node \(v\) as \(F(v) = [p_1, p_2, ..., p_k]\) and failure link as \(f(v) = v'\).
If such a non-empty list \([p_1, p_2, ..., p_k]\) does not exist, we define \(f(v) = \emptyset\). \(F(v)\) is undefined and unused in this case.
Put it another way, \(F(v)\) and \(f(v)\) are defined by finding the longest prefix of the string \(\chi_v\) that matches a vocabulary token, popping it, and repeating this procedure until the suffix string is found on the trie. Figure 1b shows \(F(v)\) and \(f(v)\) computed for the example vocabulary and trie.
For readers with the background of finite-state transducers (FSTs) (Mohri, 1997), it’s helpful to see that \(f(v)\) is related to the state transition function and \(F(v)\) is related to the output function (more discussions in Section 3.6).
Assume that, based on the vocabulary, we have precomputed the trie, failure links, and failure pops (precomputation is discussed in Section 3.4). Given an input string, we follow the trie edges to process the input characters one by one. When trie matching cannot continue from node \(v\), we make a failure transition in two steps: (1) retrieve failure pops \(F(v)\) and append to the end of the tokenization result, and (2) follow the failure link to node \(f(v)\). After that, we continue from the new node \(f(v)\).
Algorithm 1: LinMaxMatch Tokenization
Function LINMAXMATCH($s$): \(\begin{array}{ll} 1 & \text{tokens, } u, i \leftarrow \text{MATCHLOOP}(s, 0) \\ 2 & \text{if } i < n \text{ or } \chi_u \in \{\epsilon, r^\sharp\} \text{ then} \\ 3 & \quad \text{tokens} \leftarrow [<unk>] \\ 4 & \text{elseif } i = n \text{ and tokens = 0 then} \\ 5 & \quad \text{tokens} \leftarrow \text{ORIGINALWORDPIECE}(s) \\ 6 & \text{return tokens} \\ \end{array}\)
Function MATCHLOOP($s$, $i$): \(\begin{array}{ll} 7 & u, \text{tokens} \leftarrow r, [ ] \\ 8 & \text{while } i < n \text{ do} \\ 9 & \quad \text{while } (u, s[i]) = \emptyset \text{ do} \\ 10 & \quad \quad \text{if } f(u) = \emptyset \text{ then return tokens, } u, i \\ 11 & \quad \quad \text{tokens} \leftarrow \text{EXTEND(tokens, } F(u)) \\ 12 & \quad \quad u \leftarrow f(u) \\ 13 & \quad (u, s[i]) \\ 14 & \quad i \leftarrow i + 1 \\ 15 & \text{return tokens, } u, i \\ \end{array}\)
For now, ignore lines 4-5; we explain it later.
The main function calls MATCHLOOP() with two inputs: \(w\) appended by a whitespace and the start position 0 (line 1). Inside that function, let’s use the term step to denote an iteration of the loop on lines 8-14, which processes one input character \(s[i]\). Each step starts from the current node \(u\) and follows \(f(u)\) zero, one, or multiple times (line 12), appending the tokens in \(F(u)\) to the result along the way (line 11), until it finds a trie edge that matches the current character (line 9) or \(f(u) = \emptyset\) (line 10). If the input string \(w\) can be tokenized, the loop continues until \(i = \\|s\\| - 1\) pointing to the final appended whitespace. We know that \(\delta(u, \ ) = \emptyset\) for any \(u\) (since whitespace is not in any vocabulary token). MATCHLOOP() will keep following \(f(u)\) while collecting \(F(u)\) tokens along the way (lines 11-12) until it arrives at \(r♯\), where \(f(r♯) = \emptyset\). MATCHLOOP() returns on line 10 with \(u = r♯\), \(i = \\|s\\| - 1 = \\|w\\|\), and tokens being the expected result (see Example 3). If \(w = \epsilon\), MATCHLOOP() returns immediately with \(u = r\), \(i = 0 = \\|w\\|\), and empty tokens. In either case, the tokens are returned by the main function (line 6).
On the other hand, if the word cannot be tokenized, when MATCHLOOP() returns on line 10, there are two cases:
Example 3: Consider \(s = w = \text{abcdz}\), using the vocabulary from Figure 1a. The expected tokenization is \([ \text{a}, \text{##b}, \text{##c}, \text{##dz}]\).
Table 2: Sequence of node transitions and result tokens.
Table 2 shows the sequence of node transitions and result tokens in MATHLOOP(). The first row is the original state. Steps 1-4 are self-explanatory.
Step 5 is more complex: when we reach step 5, the prefix “abcd” has already been processed. The current node is node 6, and the next character is ‘z’. As \(\delta(6, \text{z}) = \emptyset\), we copy \(F(6)\) to the result (which becomes \([ \text{a}, \text{##b}]\)) and follow \(f(6)\) to node 10. Next, as \(\delta(10, \text{z}) = \emptyset\), we copy \(F(10)\) to the result (which becomes \([ \text{a}, \text{##b}, \text{##}]\)) and follow \(f(10)\) to node 12. Now, as \(\delta(12, \text{z}) = 13\), we follow the trie edge to node 13 and proceed to step 6.
Step 6 processes: We first follow \(f(13)\) to node 2, appending \(\text{##dz}\) to the result tokens. Then, at node 2 (i.e., \(u = 2 = r♯\)), \(\delta(u, \ ) = \emptyset\) and \(f(u) = \emptyset\). MATCHLOOP() returns on line 10.
Back to the main function (line 2), since \(i = 5 = \\|w\\|\) (meaning that MATCHLOOP() stopped at the final whitespace) and \(u = r♯\) (meaning that all matched characters “abcd” are covered by the result tokens), the word is successfully tokenized. It returns \([ \text{a}, \text{##b}, \text{##c}, \text{##dz}]\) as expected. ◊
Example 4: Consider two input words \(s_1 = w_1 = \text{abcz}\), \(s_2 = w_2 = \text{abcd}\). Using the same vocabulary, neither \(w_1\) nor \(w_2\) can be tokenized.
For \(s_1\), MATCHLOOP() consumes “abc” but not ‘z’. Hence it stops within the word: \(i = 3 < \\|w_1\\|\). For \(s_2\), MATCHLOOP() consumes all normal characters “abcd” but not the whitespace. When it returns on line 10, \(i = \\|w_2\\|\), \(u\) is node 12 (since \(f(12) = \emptyset\)), and the result tokens are \([ \text{a}, \text{##b}, \text{##c}]\), which do not cover character ‘d’. Actually, the string \(\text{##d}\) represented by node 12 cannot be tokenized.
Tokens are reset to \([<unk>]\) in both cases.
Corner cases: One behavior of the original WordPiece algorithm (Google, 2018) is that, if the input starts with the suffix indicator, the first result token may start with the suffix indicator. For example, in Figure 1, if the input is ##bc
, the tokenization result is \([ \text{##b}, \text{##c} ]\)
. In this paper, by having \(r♯\) as a descendant of \(r\), LinMaxMatch follows the same behavior and returns the same result.
Because \(r♯\) is set as a descendant of \(r\), if the input \(w\) is ♯ itself (e.g., ##
), normally Algorithm 1 would have returned an empty list of tokens, which is inconsistent with Google (2018). We handle this as a special case. Line 4 checks whether \(w\) is ♯ by the following (instead of directly comparing the strings): if and only if \(w = ♯\), the landing node \(u\) is \(r♯\) and the result tokens are empty after consuming all normal input characters (i.e., \(i = \\|w\\|\)). If so, the tokens are reset by the precomputed result of the original WordPiece algorithm on ♯ (line 5).
Algorithm 1 can be proved to be consistent with the original WordPiece algorithm (Google, 2018).
Given a vocabulary, it is straightforward to build the trie. This section explains how to precompute failure links \(f (\cdot)\) and failure pops \(F (\cdot)\).
We could compute \(f (\cdot)\) and \(F (\cdot)\) by directly using the procedure from Definition 1. Instead, we propose a faster algorithm (see Section 3.5 for complexity). Our algorithm computes \(f(v)\) and \(F(v)\) by leveraging \(f (u)\) and \(F (u)\) from the parent node \(u\). Suppose \(\delta(u, c) = v\). Intuitively, as the string \(\chi_u\) of parent \(u\) is a prefix of the string \(\chi_v\) of node \(v\), it is likely that \(F (u)\) and \(F(v)\) share some common longest-matching-prefixes in the beginning. It can be proved that when \(\chi_v \notin V\), \(F(v)\) consists of (1) the tokens from \(F (u)\), followed by (2) the longest-matching-prefixes that the procedure from Definition 1 generates for the string \(\chi_{f (u)}c\). Otherwise, when \(\chi_v \in V\), it’s trivial that \(F(v) = [\chi_v]\) based on Definition 1. Notice that \(f(v)\) and \(F(v)\) are computed using similar information for nodes that have strictly smaller depth than \(v\). Breadth-First-Search (BFS) is suitable for the computation. Algorithm 2 is the precomputation algorithm. On line 1, the algorithm builds a trie for \(V\) and keeps track of \(r\) and \(r♯\). These nodes have depth 0 and are the starting points for our BFS traversal (line 2). We assume that initially \(f(v) = \emptyset\) and \(F(v) = [ ]\) for every node \(v\). The core part is in lines 7-15, which computes \(f(v)\) and \(F(v)\) as discussed earlier.
Note that \(i = \\|w\\|\) is satisfied implicitly on line 4 (Algorithm 1) since it’s an else statement following the if statement on line 2.
The rest of the algorithm handles technical details. E.g., if ♯ is the empty string, the nodes \(r\) and \(r♯\) are identical; accordingly, line 2 avoids duplicate nodes. Otherwise, \(r♯\) is a descendant of \(r\), and we need line 6 to avoid revisiting it in the BFS traversal.
It can be proved that Algorithm 2 correctly precomputes \(f(v)\), \(F(v)\) for each trie node \(v\).
Algorithm 2: Precomputation Function PRECOMPUTE
\[\begin{array}{ll} \text{Input:} & \text{Vocabulary } V \\ \text{Output:} & \text{Failure links } f(\cdot) \text{ and failure pops } F(\cdot) \\ \text{1.} & \text{Build a trie for } V \text{, keep track of root } r \text{ and node } r♯ \\ \text{2.} & \text{Initialize } f(v) = \emptyset \text{ and } F(v) = [ ] \text{ for every node } v \\ \text{3.} & \text{function BFS(root)} \\ \text{4.} & \quad \text{Create a queue, push } r \text{ and } r♯ \text{ into the queue} \\ \text{5.} & \quad \text{while queue is not empty do} \\ \text{6.} & \quad \quad u = \text{queue.pop()} \\ \text{7.} & \quad \quad \text{for each child } v \text{ of } u \text{ do} \\ \text{8.} & \quad \quad \quad \delta(u, c) = v \\ \text{9.} & \quad \quad \quad \text{if } \chi_v \notin V \text{ then} \\ \text{10.} & \quad \quad \quad \quad F(v) \leftarrow F(u) \text{ followed by the longest-matching-prefixes for } \chi_{f(u)}c \\ \text{11.} & \quad \quad \quad \quad f(v) \leftarrow \text{node representing suffix of } \chi_{f(u)}c \\ \text{12.} & \quad \quad \quad \text{else} \\ \text{13.} & \quad \quad \quad \quad F(v) \leftarrow [\chi_v] \\ \text{14.} & \quad \quad \quad \text{queue.push}(v) \\ \text{15.} & \quad \text{end while} \\ \text{16.} & \text{end function} \\ \text{17.} & \text{BFS}(r) \\ \end{array}\]Given a vocabulary, it is straightforward to build the trie. This section explains how to precompute failure links \(f (\cdot)\) and failure pops \(F (\cdot)\).
We could compute \(f (\cdot)\) and \(F (\cdot)\) by directly using the procedure from Definition 1. Instead, we propose a faster algorithm (see Section 3.5 for complexity). Our algorithm computes \(f(v)\) and \(F(v)\) by leveraging \(f (u)\) and \(F (u)\) from the parent node \(u\). Suppose \(\delta(u, c) = v\). Intuitively, as the string \(\chi_u\) of parent \(u\) is a prefix of the string \(\chi_v\) of node \(v\), it is likely that \(F (u)\) and \(F(v)\) share some common longest-matching-prefixes in the beginning. It can be proved that when \(\chi_v \notin V\), \(F(v)\) consists of (1) the tokens from \(F (u)\), followed by (2) the longest-matching-prefixes that the procedure from Definition 1 generates for the string \(\chi_{f (u)}c\). Otherwise, when \(\chi_v \in V\), it’s trivial that \(F(v) = [\chi_v]\) based on Definition 1. Notice that \(f(v)\) and \(F(v)\) are computed using similar information for nodes that have strictly smaller depth than \(v\). Breadth-First-Search (BFS) is suitable for the computation. Algorithm 2 is the precomputation algorithm. On line 1, the algorithm builds a trie for \(V\) and keeps track of \(r\) and \(r♯\). These nodes have depth 0 and are the starting points for our BFS traversal (line 2). We assume that initially \(f(v) = \emptyset\) and \(F(v) = [ ]\) for every node \(v\). The core part is in lines 7-15, which computes \(f(v)\) and \(F(v)\) as discussed earlier.
Note that \(i = \\|w\\|\) is satisfied implicitly on line 4 (Algorithm 1) since it’s an else statement following the if statement on line 2.
The rest of the algorithm handles technical details. E.g., if ♯ is the empty string, the nodes \(r\) and \(r♯\) are identical; accordingly, line 2 avoids duplicate nodes. Otherwise, \(r♯\) is a descendant of \(r\), and we need line 6 to avoid revisiting it in the BFS traversal.
It can be proved that Algorithm 2 correctly precomputes \(f(v)\), \(F(v)\) for each trie node \(v\).
The complexity of tokenization (Algorithm 1) can be proved to be $𝑂(𝑛)$ in a similar way as AhoCorasick (Aho and Corasick, 1975). In brief, each step (an iteration of the loop from lines 8-13) makes zero or more failure transitions followed by exactly one normal (non-failure) transition. In each step, suppose we start at node $𝑢$ with depth $𝑑$. We never follow more than $𝑑$ failure transitions in that step: each failure transition takes us to a node with a strictly smaller depth. Any normal transition along trie edges increments the depth $𝑑$ of node $𝑢$ by 1 (line 13). Therefore, the total number of failure transitions is no more than the total number of normal transitions, which is $𝑂(𝑛)$. Each transition is $𝑂(1)$ plus the work to extend the list of tokens on line 11. As there are at most $𝑛$ resulting tokens in total, the total tokenization time is $𝑂(𝑛)$.
Since at least $𝑛$ operations are required to read the entire input, our $𝑂(𝑛)$ algorithm is asymptotically optimal. To the best of our knowledge, this is the first time that the optimal complexity for MaxMatch is proved to be strictly $𝑂(𝑛)$, without a vocabulary-specific multiplicative factor.
For precomputation (Algorithm 2), the BFS traversal itself is $𝑂(𝑀)$, where $𝑀$ is the sum of the lengths of vocabulary tokens. A similar depth-based analysis (as in the case of the tokenization algorithm) shows that that the total number of times we traverse a failure link on line 13 is $𝑂(𝑀)$.
The non-trivial parts are the construction of $𝐹 (⋅)$ on lines 12 and 15. The total size of $𝐹 (⋅)$ is $𝑂(𝑀𝑚)$: there are $𝑂(𝑀)$ lists, and the size of each list is $𝑂(𝑚)$. A straightforward implementation needs $𝑂(𝑀𝑚)$ time and space to construct and store $𝐹 (⋅)$. This is good enough in practice, as the precomputation is performed offline before any tokenization process. We plan to discuss optimized implementations in a follow-up publication.
LinMaxMatch can be turned into a finite-state transducer (FST) (Mohri, 1997) by eliminating the failure transitions in Algorithm 1. An FST extends a finite-state automaton (FSA) with an output tape. To turn LinMaxMatch into an FST, for node $𝑢$ and character $𝑐$, we define the state transition function $𝛿′(𝑢, 𝑐)$ and the output function $𝜎′(𝑢, 𝑐)$ as follows:
Specially, if the original trie link $𝛿(𝑢, 𝑐)$ exists, according to the above definition, it’s obvious that $𝛿′(𝑢, 𝑐) = 𝛿(𝑢, 𝑐)$ and $𝜎′(𝑢, 𝑐) = [ ]$. Then lines 9-13 in Algorithm 1 can be replaced with two statements: tokens $\leftarrow$ EXTEND(tokens, $𝜎′(𝑢, 𝑠[𝑖])$) and $𝑢 \leftarrow 𝛿′(𝑢, 𝑠[𝑖])$; the loop (started on line 8) breaks when $𝑢$ becomes $\emptyset$. Hence, LinMaxMatch makes exactly one state transition on each input character. Obviously, the time complexity is linear, despite more space needed to store precomputed results.
알고리즘 특징: 아호-코라식 알고리즘의 확장
LinMaxMatch
extends the Aho-Corasick Algorithm
(Aho and Corasick, 1975). It can be applied to more string search or transducer problems. Let us name a few here. LinMaxMatch can be adapted to solve the multi-keyword search problem which Aho-Corasick is designed for. It can be also adapted to address other MaxMatch variants, such as Backward MaxMatch (Webster and Kit, 1992), recognizing unseen characters as singlecharacter tokens (Palmer, 2000), or combing with transformation rules (Sassano, 2014). Other potential applications include word segmentation in Asian languages (Sassano, 2014), phonological or morphological analysis (Kaplan and Kay, 1994; Jurafsky and Martin, 2009).
The existing BERT tokenization implementations (Google, 2018) pre-tokenize the input text (splitting it into words by punctuation and whitespace characters) and then call WordPiece tokenization on each resulting word. For example, the text john johanson’s
may be split into [john, johan, ##son, ’, s]
.
We propose an end-to-end WordPiece tokenizer that combines pre-tokenization and WordPiece into a single, linear-time pass. It uses the LinMaxMatch trie matching and failure transition loop as much as possible and only checks for punctuation and whitespace characters among the relatively few input characters that are not handled by the loop. It is more efficient as it traverses the input only once, performs fewer punctuation / whitespace checks, and skips the creation of intermediate words.
We use the same process as in Section 3.4, with several differences:
After the trie is constructed, we remove all trie links labeled with a punctuation character. Then, for every possible punctuation character $𝑐$, we add a trie data node $𝑣$ with no descendants, and a trie link from the root $𝑟$ to $𝑣$ with label $𝑐$. If $𝑐$ is part of the vocabulary, we set $𝜒_𝑣 = 𝑐$, otherwise $𝜒_𝑣 = <\text{unk}>$. The resulting trie matches all punctuation characters, as either themselves or as <unk>
, depending on the vocabulary. Punctuation characters are not part of longer tokens, and there is no suffix token for a punctuation character. This reflects the fact that each punctuation character is a word by itself. We then run the rest of Algorithm 2 to compute the failure pops and failure links.
6 This is analogical to Aho and Corasick (1975) where the Aho-Corasick algorithm can be stated as a deterministic finite-state automaton.
7 This may remove links on the path from $𝑟$ to $𝑟^{\sharp}$ when the suffix indicator contains a punctuation; those links were unnecessary: $𝑟^{\sharp}$ is reached only by following failure links.
Finally, for punctuation nodes, we set their failure links to a special node $𝑟_𝑝$; their failure pops are not changed. The special node $𝑟_𝑝$ has no parent and no descendants, and $𝜒_{𝑟_𝑝} = 𝜀$, $𝑓 (𝑟_𝑝) = ∅$. Node $𝑟_𝑝$ indicates that a punctuation character was matched.
Algorithm 3: End-to-End Tokenization
\[\begin{array}{ll} \text{Function} \text{E2EWORDPIECE}(text): \\ 1 & \quad \text{result}, u, i, n \leftarrow [ ], r, 0, \text{len(text)} \\ 2 & \quad \text{while } i < n \text{ do} \\ 3 & \quad \quad \text{tokens}, u, i \leftarrow \text{MATCHLOOP}(text, i) \\ 4 & \quad \quad \text{if not ISWDBNDRY}(u, text[i]) \text{ or } \chi_u \in \{\epsilon, r^\sharp, \emptyset\} \text{ then} \\ 5 & \quad \quad \quad \text{tokens} \leftarrow [<unk>] \\ 6 & \quad \quad \text{else if } i = n \text{ and tokens = 0 then} \\ 7 & \quad \quad \quad \text{tokens} \leftarrow \text{ORIGINALWORDPIECE}(text) \\ 8 & \quad \quad \text{EXTEND(result, tokens)} \\ 9 & \quad \quad \text{while } i < n \text{ and not ISWDBNDRY}(u, text[i]) \text{ do} \\ 10 & \quad \quad \quad i \leftarrow i + 1 \\ 11 & \quad \text{while } i < n \text{ and ISSPACE}(text[i]) \text{ do} \\ 12 & \quad \quad \quad i \leftarrow i + 1 \\ 13 & \quad \text{return result} \\ \text{Function} \text{ISWDBNDRY}(u, c): \\ 1 & \quad \text{return } c = \emptyset \text{ or } (i > 0 \text{ and ISPUNC}(text[i-1])) \text{ or ISSPACE}(text[i]) \text{ or ISPUNC}(text[i]) \\ \end{array}\]Tokenization Algorithm 3 tokenizes general text into wordpieces. It starts by appending a whitespace at the end of the input (line 1). In each iteration, it recognizes wordpieces for the current word by employing (almost) the same routine as in single-word tokenization (lines 3-7 in Algorithm 3 versus lines 1-5 in Algorithm 1).
When returning from MATCHLOOP()
, Algorithm 3 must have met a character that cannot be consumed after attempting failure transitions, such as a whitespace, a punctuation, or some unseen character. Lines 4-5 examine whether the current word can be tokenized (by checking if the current position is at a word boundary and where the node $𝑢$ lands at) and reset the tokens as appropriate (see related discussions in Section 3.3).
Lines 6-7 further handle the corner case that the word happens to be the suffix indicator itself (in the same way as Algorithm 1, see Section 3.3). Note that normally the suffix indicator contains only punctuation characters (e.g., ## in BERT); in that case lines 6-7 can be saved, because the suffix indicator itself is not tokenized as a single word. The tokens of the current word are then appended to the result (line 8). Finally, the algorithm moves the cursor past the boundary of the current word (lines 9-10) and skips any following whitespaces (lines 11-12) to process the next word.
It can be shown that Algorithm 3 is consistent with Google (2018) for general text tokenization, and the time complexity is $𝑂(𝑛)$.
Experimental Setup We benchmark our method against two widely-adopted WordPiece tokenization implementations:
Algorithm 3: End-to-End Tokenization
TensorFlow Text (Google, 2020), the official library of text utilities for TensorFlow. In both cases, we use pre-tokenization and WordPiece tokenization, and skip other steps provided by those libraries (text cleanup, normalization, etc) for fair comparison. Both libraries use the original WordPiece tokenization algorithm (Google, 2018). They both generate not only the numeric ids of the tokens, but also the token strings and start/end offsets of the input word. We modify both libraries to generate only the token ids,9 for two reasons:
We implement LinMaxMatch and E2E WordPiece and made them return the numeric ids of the tokens, leveraging a double array-based trie library (Yata et al., 2007).
We compare our algorithms with HuggingFace Tokenizers (HuggingFace, 2020), from the HuggingFace Transformer
8 The common routine can be factored out as a function.
9 The original TensorFlow Text first generates the token strings and next looks them up in a dictionary to generate token ids. For a fair comparison, we adapt it to directly return the token ids, with no intermediate token strings.
All experiments are conducted on a Linux desktop with a six-core Intel Xeon @ 3.60GHz CPU and 64GB memory. We iterate each benchmark (after warming up) until it ran for a long-enough period of time, repeat each experiment 10 times, and report the average results. Our method is implemented and benchmarked in C++; so is TensorFlow Text. HuggingFace uses (and is benchmarked in) Rust. We use the WordPiece vocabulary released with the BERT-Base, Multilingual Cased model, a model that supports 104 languages (Google, 2018). To generate the test data, we sample 1,000 sentences from the multilingual Wikipedia dataset, covering 82 languages including English, Chinese, French, Russian, etc. On average, each word has 4 characters, and each sentence has 82 characters or 17 words. We found this dataset large enough: a much larger dataset (consisting of hundreds of thousands of sentences) generated similar results. We run BERT’s BasicTokenizer (Google, 2018) to clean up and normalize each sentence, including Unicode clean-up and normalization. Following the guidance for the BERT-Base Multilingual Cased model (Google, 2018), we do not instruct BasicTokenizer to do lower casing or accent stripping. In addition, preprocessing adds spaces around every CJK character, and thus Chinese is effectively character-tokenized. For simplicity, we keep Chinese in the test set, but keep in mind that each Chinese word is just one Chinese character, and any WordPiece implementation is efficient on such short words. Using a dataset with long words would emphasize the speed advantage of our algorithm even more than indicated below. For single-word tokenization, we further used BasicTokenizer to pre-tokenize each sentence on punctuation and whitespace characters. This results in 17,223 words, 8,508 of them unique.
Results Table 3 shows the mean and the 95 percentile10 running time when tokenizing a single word or general text (end-to-end) for each system. For single-word tokenization, ours is 3x faster on average; the speedup is greater for long-tail inputs. Regarding general text end-to-end tokenization, ours is 8.2x faster than HuggingFace and 5.1x faster than TensorFlow Text on average. Figure 2 shows how the running time grows with respect to the input length for single-word tokenization.
10 When computing the 95 percentile, the running time on each individual input is approximated by the average running time of all input examples of the same length.
Table 3: The running time of each system in ns.
Figure 2: Average running time of each system with respect to the input length for single-word tokenization.
We proposed LinMaxMatch for single-word WordPiece tokenization, which is asymptoticallyoptimal linear-time with respect to the input length, without a vocabulary-specific multiplicative factor. We also proposed E2E WordPiece that combines pre-tokenization and WordPiece tokenziation into a single, linear-time pass for even higher efficiency. Experimental results show that our approach is 8.2x faster than HuggingFace and 5.1x faster than TensorFlow Text on average for general text tokenization. For future work, we will adapt the proposed methods to more text processing techniques.