๐Ÿฆญ BERT#

Category of Task: NLP

Kind of Task: Masked Language Modeling

Overview#

The BERT model family is a set of transformer-based networks for Masked language modeling at various scales. This family was originally proposed by Google AI and is trained on the BooksCorpus (800M words) and English Wikipedia (2,500M words). It is useful for downstream language classification tasks such as Sentence Classification, Sentiment Analysis, Sentence Similarity, and Natural Language Inference.

Attribution#

The BERT model family is described in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.

Architecture#

BERT consists of a multi-layer bidirectional Transformer encoder parameterised by \(n_{num_hidden_layers}\), \(d_{hidden_size}\), \(d_{num_attention_heads}\). The parameters for each model family member can be seen below:

Name

Parameters

\(n_{num_hidden_layers}\)

\(d_{hidden_size}\)

\(d_{num_attention_heads}\)

BERT-Base

110M

12

768

12

BERT-Large

340M

24

1024

16

Family Members#

We chose to implement BERT-Base as it is small enough to rapidly test methods using a single GPU node.

Model Family Member

Parameters

Training Hours on 8xA100s

Training Tokens

Cross Entropy Loss

Masked Accuracy

BERT-Base

110M

10h 38m

35.2B

1.59

0.67

Implementation Details#

Our codebase builds off of the Hugging Face Transformers library. We initialize Huggingfaceโ€™s BERT model with one of our configurations.

Our recipe training is based off How to Train BERT with an Academic Budget by Peter Izsak, Moshe Berchansky, and Omer Levy. Specifically, we skip the Next Sentence Prediction loss and maintain a sequence length of 128.

After reproducing the original work, we decided to pre-train with the C4 dataset (Colossal Clean Crawled Corpus) from Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. C4 has been shown to yield better results on downstream tasks.