Ananda Seelan

Ananda Seelan

Deep Learning Engineer @ Zoho

Graduate Student Online @ Georgia Tech

Summary - BERT Pre-training of Deep Bidirectional Transformers for Language Understanding

Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Highlights

Prior to BERT, all language model pre-training techniques such as Open AI GPT relied only on uni-directional LMs.
Bidirectionality will be of help while learning representations that can be used in downstream tasks such as question answering, etc.
Only one or few additional output layers along with fine-tuning gets state-of-the-art results in multiple downstream tasks.
No task specific architecture modifications are done.

Feature based vs Fine-tuning based pre-trained representations

Models such as ElMo use the pre-trained representations as additional features over a task specific architecture.
GPT introduces only a minimal set of task-specific params and is trained on the downstream tasks along with fine-tuning the pre-trained params.
All these models have only been leveraging uni-directional LMs.

BERT objectives

MLM - Masked Language Model

Randomly mask some of the input tokens.
Predict the original vocab ID of the masked token based only on its content.
Next word prediction (which is the usual LM objective) is trivial in a bi-directional setting. Hence this objective makes sense.
15 % random tokens are masked.

NSP - Next Sentence Prediction

Jointly pre-train text representations with this objective.
Next sentence prediction with sentence pairs taken from a monolingual corpus.
50 % of correct pairs and 50 % sampled at random.

Pre-training & Fine-tuning

Pre-training: Model is trained on unlabelled data
Fine-tuning: Initialize with pre-trained params and fine-tune with lablled data for downstream tasks.

Size

Bert_base: L (num layers) = 12, H (hidden size) = 768, A (attention heads) = 12, params = 110M
Bert_large: L = 24, H = 1024, A = 16, params = 340M

Input/Output representations

Representations should be in such a way as to support multiple downstream tasks.
Represent both single/pair of sentences in a single example sequence.

Token Embeddings

[CLS] Sentence 1 [SEP] Sentence 2 [SEP]
Each sentence is tokenized using wordpiece embeddings

Segment Embeddings

Additional embeddings are learnt to identify whether a token belongs to the first or second sentence.

Position Embeddings

These embeddings are learnt to make the representations be aware of the relative positions of the tokens in the sequence. Max length here is 512.

All the above embeddings are concatenated together to form the input representation.

Pre-training

The model with the configuration as mentioned above is trained with MLM and NSP objectives. Training data consists of Books Corpus, Wikitext and Billion Word.

Fine-tuning

An extra output layer is added and the entire model is fine-tuned for a few epochs.
In case of tasks with sentence pairs as input such as NLI, Question Answering, Paraphrase Detection, etc. , sentences A & B of pre-training correspond to the input pairs.
For sentence classification the final representation of the [CLS] token is sent to the output layer.
The final representations of all the tokens are sent to the output layer as well.

Experiments

GLUE

Final hidden vector C of [CLS] is the aggregate representation.
Log softmax of this representation is used as classification.
7% absolute improvement over the previous baseline. GPT - 72.8%, BERT - 80.5%

SQuAD 1.1

Question and passage are packaed as a single sequence.
S/E tokens for start/end are introduced during fine-tuning.
Absolute increase in F-1 of 1.5.

SQuAD 2.0

No answers are those with start/end span at [CLS].
Absolute increase in F-1 of 5.1.