โฉ๏ธ Gated Linear Units#
[How to Use] - [Suggested Hyperparameters] - [Technical Details] - [Attribution]
Natural Language Processing
Gated Linear Units replaces the projection matricies in the feed-forward block with Gated Linear Units.
*These equations compare the projection matricies in a standard feed-forward network, and a Gated Linear Unit. |
Following (Shazeer, 2020), we omit the use of bias terms. \(\cdot\) represents a dot product.* |
How to Use#
Functional Interface#
# Apply surgery on the model to swap the feed-forward block
# for a gated feed-forward block using the Composer Functional API
import composer.functional as cf
def training_loop(model, train_loader):
cf.apply_gated_linear_units(model)
opt = torch.optim.Adam(model.parameters())
loss_fn = F.cross_entropy
model.train()
for X, y in train_loader:
y_hat = model(X)
loss = loss_fn(y_hat, y)
loss.backward()
opt.step()
opt.zero_grad()
Composer Trainer#
from composer.trainer import Trainer
from composer.algorithms import GatedLinearUnits
trainer = Trainer(model=model,
train_dataloader=train_dataloader,
eval_dataloader=eval_dataloader,
max_duration='1ep',
algorithms=[GatedLinearUnits()])
trainer.fit()
Implementation Details#
Gated Linear Units provide a more expressive form for a feed-forward block by performing a โgatingโ operation on the input matrix. The careful reader will recognize that we introduce a new weight matrix, \(W_3\). In order to iso-parameter experiments, we scale \(D_{ff}\) by \(\frac{2}{3}\).
This algorithm significant improves convergence, but with a slight degredation to throughput. We recommend training with bias = False
, even if biases are enabled in the rest of your model. This substantially improved throughput and convergence.
Suggested Hyperparameters#
While hyperparameters can vary significantly per use case, we recommend training with
act_fn = {ReLU, GeLU},
gated_layer_bias = False,
non_gated_layer_bias = False
We observed that, on average, GeLU activation functions marginally performed better than ReLU activation functions, and observed a significant improvement from using GeLU and ReLU over Swish and a Squared ReLU. We observed a significant benefit from setting bias = False
for both weight matricies \(W_1\) and \(W_3\).
Technical Details#
While there are many hypotheses for the performace of Gated Linear Units, the community lacks a through investigation of these. The algorithm has been shown to perform well empirically, and there remains an open curiosity as to why step-wise convergence is significantly better without bias terms than with bias terms. Furthermore, in order to maximize throughput, the user should make sure that the scaled down feature dimension when using GLUs is still a multiple of eight.
Attribution#
The Composer implementation of this method and the accompanying documentation were produced by Moin Nadeem at MosaicML.