TechAIApp

Training a Tokenizer for BERT Models


BERT is an early transformer-based model for NLP tasks that’s small and fast enough to train on a home computer. Like all deep learning models, it requires a tokenizer to convert text into integer tokens. This article shows how to train a WordPiece tokenizer following BERT’s original design.

Let’s get started.

Training a Tokenizer for BERT Models
Photo by JOHN TOWNER. Some rights reserved.

Overview

This article is divided into two parts; they are:

  • Picking a Dataset
  • Training a Tokenizer

Picking a Dataset

To keep things simple, we’ll use English text only. WikiText is a popular preprocessed dataset for experiments, available through the Hugging Face datasets library:

On first run, the dataset downloads to ~/.cache/huggingface/datasets and is cached for future use. WikiText-2 that used above is a smaller dataset suitable for quick experiments, while WikiText-103 is larger and more representative of real-world text for a better model.

The output of this code may look like this:

The dataset contains strings of varying lengths with spaces around punctuation marks. While you could split on whitespace, this wouldn’t capture sub-word components. That’s what the WordPiece tokenization algorithm is good at.

Training a Tokenizer

Several tokenization algorithms support sub-word components. BERT uses WordPiece, while modern LLMs often use Byte-Pair Encoding (BPE). We’ll train a WordPiece tokenizer following BERT’s original design.

The tokenizers library implements multiple tokenization algorithms that can be configured to your needs. It saves you the hassle of implementing the tokenization algorithm from scratch. You should install it with pip command:

Let’s train a tokenizer:

Running this code may print the following output:

This code uses the WikiText-103 dataset. The first run downloads 157MB of data containing 1.8 million lines. The training takes a few seconds. The example shows how "Hello, world!" becomes 5 tokens, with “Hello” split into “Hell” and “##o” (the “##” prefix indicates a sub-word component).

The tokenizer created in the code above has the following properties:

  • Vocabulary size: 30,522 tokens (matching the original BERT model)
  • Special tokens: [PAD], [CLS], [SEP], [MASK], and [UNK] are added to the vocabulary even though they are not in the dataset.
  • Pre-tokenizer: Whitespace splitting (since the dataset has spaces around punctuation)
  • Normalizer: NFKC normalization for unicode text. Note that you can also configure the tokenizer to convert everything into lowercase, as the common BERT-uncased model does.
  • Algorithm: WordPiece is used. Hence the decoder should be set accordingly so that the “##” prefix for sub-word components is recognized.
  • Padding: Enabled with [PAD] token for batch processing. This is not demonstrated in the code above, but it will be useful when you are training a BERT model.

The tokenizer saves to a fairly large JSON file containing the full vocabulary, allowing you to reload the tokenizer later without retraining.

To convert a string into a list of tokens, you use the syntax tokenizer.encode(text).tokens, in which each token is just a string. For use in a model, you should use tokenizer.encode(text).ids instead, in which the result will be a list of integers. The decode method can be used to convert a list of integers back to a string. This is demonstrated in the code above.

Below are some resources that you may find useful:

This article demonstrated how to train a WordPiece tokenizer for BERT using the WikiText dataset. You learned to configure the tokenizer with appropriate normalization and special tokens, and how to encode text to tokens and decode back to strings. This is just a starting point for tokenizer training. Consider leveraging existing libraries and tools to optimize tokenizer training speed so it doesn’t become a bottleneck in your training process.



Source link

Exit mobile version