Summary - BERT Pre-training of Deep Bidirectional Transformers for Language Understanding

Arxiv Link

Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

Highlights

  • Prior to BERT, all language model pre-training techniques such as Open AI GPT relied only on uni-directional LMs.
  • Bidirectionality will be of help while learning representations that can be used in downstream tasks such as question answering, etc.
  • Only one or few additional output layers along with fine-tuning gets state-of-the-art results in multiple downstream tasks.
  • No task specific architecture modifications are done.

Feature based vs Fine-tuning based pre-trained representations

  • Models such as ElMo use the pre-trained representations as additional features over a task specific architecture.
  • GPT introduces only a minimal set of task-specific params and is trained on the downstream tasks along with fine-tuning the pre-trained params.
  • All these models have only been leveraging uni-directional LMs.

BERT objectives

MLM - Masked Language Model

  • Randomly mask some of the input tokens.
  • Predict the original vocab ID of the masked token based only on its content.
  • Next word prediction (which is the usual LM objective) is trivial in a bi-directional setting. Hence this objective makes sense.
  • 15 % random tokens are masked.

NSP - Next Sentence Prediction

  • Jointly pre-train text representations with this objective.
  • Next sentence prediction with sentence pairs taken from a monolingual corpus.
  • 50 % of correct pairs and 50 % sampled at random.

Pre-training & Fine-tuning

  • Pre-training: Model is trained on unlabelled data
  • Fine-tuning: Initialize with pre-trained params and fine-tune with lablled data for downstream tasks.

Size

  • Bert_base: L (num layers) = 12, H (hidden size) = 768, A (attention heads) = 12, params = 110M
  • Bert_large: L = 24, H = 1024, A = 16, params = 340M

Input/Output representations

  • Representations should be in such a way as to support multiple downstream tasks.
  • Represent both single/pair of sentences in a single example sequence.

Token Embeddings

Segment Embeddings

  • Additional embeddings are learnt to identify whether a token belongs to the first or second sentence.

Position Embeddings

  • These embeddings are learnt to make the representations be aware of the relative positions of the tokens in the sequence. Max length here is 512.

All the above embeddings are concatenated together to form the input representation.

Pre-training

  • The model with the configuration as mentioned above is trained with MLM and NSP objectives. Training data consists of Books Corpus, Wikitext and Billion Word.

Fine-tuning

  • An extra output layer is added and the entire model is fine-tuned for a few epochs.
  • In case of tasks with sentence pairs as input such as NLI, Question Answering, Paraphrase Detection, etc. , sentences A & B of pre-training correspond to the input pairs.
  • For sentence classification the final representation of the [CLS] token is sent to the output layer.
  • The final representations of all the tokens are sent to the output layer as well.

Experiments

GLUE

  • Final hidden vector C of [CLS] is the aggregate representation.
  • Log softmax of this representation is used as classification.
  • 7% absolute improvement over the previous baseline. GPT - 72.8%, BERT - 80.5%

SQuAD 1.1

  • Question and passage are packaed as a single sequence.
  • S/E tokens for start/end are introduced during fine-tuning.
  • Absolute increase in F-1 of 1.5.

SQuAD 2.0

  • No answers are those with start/end span at [CLS].
  • Absolute increase in F-1 of 5.1.