BERT is a transformer-based model for NLP tasks that was released by Google in 2018. It is found to be useful for a wide range of NLP tasks. In this article, we will overview the architecture of BERT and how it is trained. Then, you will learn about some of its variants that are released later.
Let’s get started.
BERT Models and Its Variants.
Photo by Nastya Dulhiier. Some rights reserved.
Overview
This article is divided into two parts; they are:
- Architecture and Training of BERT
- Variations of BERT
Architecture and Training of BERT
BERT is an encoder-only model. Its architecture is shown in the figure below.
The BERT architecture
While BERT uses a stack of transformer blocks, its key innovation is in how it is trained.
According to the original paper, the training objective is to predict the masked words in the input sequence. This is a masked language model (MLM) task. The input to the model is a sequence of tokens in the format:
[CLS]
where and are sequences from two different sentences. The special tokens [CLS] and [SEP] separate them. The [CLS] token serves as a placeholder at the beginning and it is where the model learns the representation of the entire sequence.
Unlike common LLMs, BERT is not a causal model. It can see the entire sequence, and the output at any position depends on both left and right context. This makes BERT suitable for NLP tasks such as part-of-speech tagging. The model is trained by minimizing the loss metric:
$$\text{loss} = \text{loss}_{\text{MLM}} + \text{loss}_{\text{NSP}}$$
The first term is the loss for the masked language model (MLM) task and the second term is the loss for the next sentence prediction (NSP) task. In particular,
- MLM task: Any token in
orcan be masked and the model is supposed to identify them and predict the original token. This can be any of the three possibilities: - The token is replaced with
[MASK]token. The model should recognize this special token and predict the original token. - The token is replaced with a random token from the vocabulary. The model should identify this replacement.
- The token is unchanged, and the model should predict that it is unchanged.
- NSP task: The model is supposed to predict whether
is the actual next sentence that comes after. This means both sentences are from the same document and they are adjacent to each other. This is a binary classification task. This is predicted using the[CLS]token at the beginning of the sequence.
Hence the training data contains not only the text but also additional labels. Each training sample contains:
- A sequence of massked tokens:
[CLS], with some tokens replaced according to the rules above.[SEP] [SEP] - Segment labels (0 or 1) to distinguish between the first and second sentences
- A boolean label indicating whether
actually followsin the original document - A list of masked positions and their corresponding original tokens
This training approach teaches the model to analyze the entire sequence and understand each token in context. As a result, BERT excels at understanding text but is not trained for text generation. For example, BERT can extract relevant portions of text to answer a question, but cannot rewrite the answer in a different tone. This training with the MLM and NSP objectives is called pre-training, after which the model can be fine-tuned for specific applications.
BERT pre-training and fine-tuning. Figure from the BERT paper.
Variations of BERT
BERT consists of $L$ stacked transformer blocks. Key hyperparameters of the model include the size of hidden dimension $d$ and the number of attention heads $h$. The original base BERT model has $L = 12$, $d = 768$, and $h = 12$, while the large model has $L = 24$, $d = 1024$, and $h = 16$.
Since BERT’s success, several variations have been developed. The simplest is RoBERTa, which maintains the same architecture but uses Byte-Pair Encoding (BPE) instead of WordPiece for tokenization. RoBERTa trains on a larger dataset with larger batch sizes and more epochs. The training uses only the MLM loss without NSP loss. This demonstrates that the original BERT model was under-trained. The improved training strategies and more data can enhance performance without increasing model size.
ALBERT is a faster model of BERT with fewer parameters that introduces two techniques to reduce model size. First is factorized embedding: the embedding matrix transforms input integer tokens into smaller embedding vectors, which a projection matrix then transforms into larger final embedding vectors to be used by the transformer blocks. This can be understood as:
$$
M = \begin{bmatrix}
m_{11} & m_{12} & \cdots & m_{1N} \\
m_{21} & m_{22} & \cdots & m_{2N} \\
\vdots & \vdots & \ddots & \vdots \\
m_{d1} & m_{d2} & \cdots & m_{dN}
\end{bmatrix}
= N M’ = \begin{bmatrix}
n_{11} & n_{12} & \cdots & n_{1k} \\
n_{21} & n_{22} & \cdots & n_{2k} \\
\vdots & \vdots & \ddots & \vdots \\
n_{d1} & n_{d2} & \cdots & n_{dk}
\end{bmatrix}
\begin{bmatrix}
m’_{11} & m’_{12} & \cdots & m’_{1N} \\
m’_{21} & m’_{22} & \cdots & m’_{2N} \\
\vdots & \vdots & \ddots & \vdots \\
m’_{k1} & m’_{k2} & \cdots & m’_{kN}
\end{bmatrix}
$$
Here, $N$ is the projection matrix and $M’$ is the embedding matrix with smaller dimension size $k$. When a token is input, the embedding matrix serves as a lookup table for the corresponding embedding vector. The model still operates on a larger dimension size $d > k$, but with the projection matrix, the total number of parameters is $dk + kN = k(d+N)$, which is drastically smaller than a full embedding matrix of size $dN$ when $k$ is sufficiently small.
The second technique is cross-layer parameter sharing. While BERT uses a stack of transformer blocks that are identical in design, ALBERT enforces that they are also identical in parameters. Essentially, the model processes the input sequence through the same transformer block $L$ times instead of through $L$ different blocks. This reduces the model complexity but does only slightly degrade the model performance.
DistilBERT uses the same architecture as BERT but is trained through distillation. A larger teacher model is first trained to perform well, then a smaller student model is trained to mimic the teacher’s output. The DistilBERT paper claims the student model achieves 97% of the teacher’s performance with only 60% of the parameters.
In DistilBERT, the student and teacher models have the same dimension size and number of attention heads, but the student has half the number of transformer layers. The student is trained to match its layer outputs to the teacher’s layer outputs. The loss metric combines three components:
- Language modeling loss: The original MLM loss metric used in BERT
- Distillation loss: KL divergence between the student model and teacher model’s softmax outputs
- Cosine distance loss: Cosine distance between the hidden states of every layer in the student model and every other layer in the teacher model
These multiple loss components provide additional guidance during distillation, resulting in better performance than training the student model independently.
Further Reading
Below are some resources that you may find useful:
Summary
This article covered BERT’s architecture and training approach, including the MLM and NSP objectives. It also presented several important variations: RoBERTa (improved training), ALBERT (parameter reduction), and DistilBERT (knowledge distillation). These models offer different trade-offs between performance, size, and computational efficiency for various NLP applications.
