๐Ÿ“š GPT-2#

Category of Task: NLP

Kind of Task: Autoregressive Language Modeling

Overview#

The GPT-2 model family is a set of transformer-based networks for autoregressive language modeling at various scales. This family was originally proposed by OpenAI and is trained on the OpenWebText dataset. It is useful for downstream language generation tasks such as summarization, translation, and dialog.

Attribution#

The GPT model family is described in Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.

The scaling law that we use to choose the members of this model family is described in Scaling Laws for Neural Language Models by Jared Kaplan, Sam McCandish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.

Architecture#

GPT-2 consists of a a decoder-only Transformer parameterized by \(n_{layer}\), \(d_{model}\), \(d_{ff}\), \(d_{attn}\) and \(n_{heads}\). The parameters for each model family member can be seen below:

Name

\(n_{layer}\)

\(d_{model}\)

\(d_{ff}\)

\(d_{attn}\)

\(n_{heads}\)

GPT-2 52M

8

512

2048

8

8

GPT-2 83M

10

640

2560

640

10

GPT-2 125M

12

768

3072

768

12

Family Members#

We implement three members of this family at different scales: GPT 52M, GPT 83M, and GPT 125M. These models are named after their parameter counts. We selected these particular configurations because (1) they represent points on the pareto frontier of the scaling law for language models as described by Kaplan et al. at OpenAI and (2) they are small enough to rapidly iterate on methods using a single GPU node.

Model Family Member

Parameters

Training Hours on 8xA100s

Training Tokens

Final Loss

Predicted Perplexity

Actual Perplexity

GPT-2 52M

53.9M

02:44

4.6B

3.43

32.54

30.88

GPT-2 83M

85.8M

04:52

5.5B

3.28

27.84

26.57

GPT-2 125M

114M

08:25

6.7B

3.18

24.64

24.04

Implementation Details#

Our codebase builds off of the Hugging Face Transformers library. We initialize Huggingfaceโ€™s GPT-2 model with one of our configurations.

Exploring Tradeoffs Between Quality and Training Speed / Cost#

There are two ways of varying the amount of time necessary to train a model and the cost necessary to do so: varying the size of the model or varying the number of steps (and therefore data) for which the model is trained. With the GPT family of models, we explore both of these axes. To develop methods for these models, we generally begin with the smallest members of this model family for initial experimentation and scale up once the ideas have been refined.

To explore tradeoffs between quality and the number of training steps, we have ablated both the number of training steps and the number of data points to train on. We do this by checkpointing the model throughout training.

To explore tradeoffs between quality and the size of the model, we use โ€œScaling Laws for Neural Language Modelsโ€ to provide suggestions on model capacity and dataset size and then sweep hyperparameters such as learning rate and batch size to minimize loss.