Scaling Laws

Applicable Settings: Autoregressive Language Modeling, NLP

Effects: Decreased Wall Clock Time

Kind: Best Practice

Tags: Speedup

TLDR

Scaling Laws capture the optimal training regime for a fixed compute budget. Given a number of FLOPs (Floating Point Operations), scaling laws will determine the optimal number of parameters in your model and tokens in your dataset.

Attribution

Scaling Laws for Neural Language Models by Kaplan et al. Posted to arXiv in 2020.

Applicable Settings

Scaling laws were originally studied in the context of autoregressive language modeling, which is the setting that they are implemented for in the Composer repository. However, there is emerging research on scaling laws in domains such as Vision Transformers and Generalized Autoregressive Modeling. We recommend reading Jonathan Rosenfield’s Ph.D thesis: “Scaling Laws for Deep Learning“.

Hyperparameters

Scaling Laws does not have any hyperparameters.

Detailed Results

A Recap on PF-Days: FLOP/s, or floating-point operations per second is a measure of computer performance that describes the number of floating point multiplication operations a computer can perform per second. The following table provides tera-flop / second capacity for various common hardware accelerators:

Hardware Accelerator TFLOP/s

Scaling Laws describes the relationship between the compute budget for training a model (in PF-days) and the final loss. The compute budget can be allocated between two quantities: model size (parameters), and dataset size (tokens). For most neural network training regimes, we can also break the dataset size into the optimization batch size, and how many parameter updates need to happen (or “steps”).

Scaling Laws predict that the loss is optimized by allocating the compute budget according the following ratio: 73% of FLOPs allocated to larger models, 24% to larger batch sizes, 3% to increasing the number of parameter updates. Kaplan et al. report additional empirical results regarding the relationship between dataset size, model capacity, and test loss:

1. Studying the unlimited data regime

For most of supervised modeling, training for multiple epochs on a manually labeled dataset has been standard practice. However, for self-supervised autoregressive modeling, data is relatively abundant. Therefore, one must ask: for a given model size, how much data should one train on?

For instance, BERT and GPT-2 have been trained on 40GB of text with at least 40 epochs, and RoBERTa scaled that up to ~160GB of text repeated over 40 epochs. However, since a larger training corpora can easily be generated by scraping more of the internet, how much text should any particular model use for training?

alternate text

Source: Figure 4 (left) from “Scaling Laws for Neural Language Models”

Kaplan et al. **found that, after a minimum model size, test loss obeys a predictable trend. The following equation describes the predicted number of tokens to efficiently train on for a particular compute budget \(C_{min}\):

\(D_{opt} = D_e * C_{min}^{p_D}\) , where \(p_D = 0.27\) and \(D_e = 2 * 10^{10}\) tokens

2. Performance depends weakly on Model Architecture

alternate text

Source: Figure 5 from “Scaling Laws for Neural Language Models”

Kaplan et al. also find a recipe for allocating parameters within a model architecture. By discovering the optimal ratios between key parts of a model architecture that minimize loss, it then becomes possible to fully determine a model architecture that should minimize loss for a given parameter budget.

Similar to the equation for the optimal amount of training tokens, we can derive an equation regarding the optimal number of non-embedding parameters \(N\) for a given compute budget \(C_{min}\):

\(N_{opt} = N_e * C_{min}^{p_N}\), where \(p_N = 0.73\) and \(N_e = 1.3 * 10^9\) parameters.

Kaplan et al. parameterize a Transformer using \(n_{layer}\), \(d_{model}\), \(d_{ff}\), \(d_{attn}\) and \(n_{heads}\), where the number of non-embedding parameters:

\(N = 12 \cdot n_{layer} \cdot d_{model}^2\), where \(d_{attn} = d_{ff} / 4 = d_{model}\).

Therefore, given a single parameter (\(C_{min}\)), we can provide approximations for what model capacity, architecture, and dataset size should be used.

Considerations

We would like to emphasize that while we have found Scaling Laws to be a reasonable starting point, they are not an exhaustive recipe for training. We have still found benefit in sweeping across hyperparameters such as learning rate. We view Scaling Laws as suggestions on where to start, and performing sweeps on the neighborhoods around the suggestions can be extremely fruitful (with as much as 20% better performance for the same budget).

Effects & Implications

alternate text

Scaling Laws for Neural Language Models aims to answer the question: “how should we train models in a compute-efficient manner?”

The X axis on the plots above demonstrates that we can train the same model for nearly an order of magnitude more tokens or FLOPs, and still obtain the same loss. Therefore, applying Scaling Laws to determine our model capacity, dataset size, and architecture can help save orders of magnitude of compute while maintaining performance.