In recent years, artificial intelligence models like DALL-E for image generation and ChatGPT for natural language have absolutely exploded in their capabilities. However, training the largest AI models requires massive amounts of computing power, with estimates being thousands of GPUs running continuously for months. This level of computational requirements puts training giant AI models out of reach for all but a few large tech companies and research labs.
In a new paper published on ArXiv, researchers at Google DeepMind demonstrate it's possible to study the training stability of these huge AI models without direct access to such enormous compute resources. By training small models and studying how they behave, insights can be gained that translate to truly gigantic models with billions of parameters.
The Challenges of Scaling AI Training
As AI models continue to grow in size from millions to billions of parameters, the computational resources needed for training them keeps pace. For example, OpenAI's GPT-3 model with 175 billion parameters required thousands of GPUs running for months to train. And the cost for this level of compute is estimated in the millions of dollars.
As models get bigger, training becomes more unstable and prone to crashes. Teams that have scaled up models have reported various instabilities that emerge such as spikes in the loss function or diverging output values. Some models become unable to learn entirely past a certain scale without careful tuning of hyperparameters.
Investigating these training stability issues requires access to the same massive computational capabilities used to train the large models in the first place. This makes research into techniques and practices for stable scaling accessible to only a few large tech companies and labs. Progress relies on their willingness to publish findings.
Studying Small Models to Understand Large Ones
The DeepMind researchers realized they could recreate some of the instabilities seen when training huge models by instead training small ones with very high learning rates. Although small in size, pushing the learning rate to extremes made the models exhibit similar behaviors as unstable gigantic models.
In particular, the team focused on two main training issues reported in literature on large models:
- Attention collapse - Where the model's attention focuses on only a small subset of available tokens rather than distributing across what's available. This is analogous to overfitting in traditional machine learning.
- Logit divergence - Where the pre-softmax output values from the model start to drift and become highly negative. This makes training diverge and become unstable.
Through their experiments, the researchers showed techniques that have been successfully used to stabilize training for models like GPT-3 with billions of parameters worked equally well to avoid attention collapse and logit divergence for small models.
For example, a method called qk-layernorm introduced by Anthropic's researchers in a paper this year prevented attention collapse in small models. And adding a regularization term called z-loss to the loss function enabled stable training despite logit divergence.
Learning Rate Sweeps
By plotting the final loss after training versus the learning rate used across different model sizes, the DeepMind team introduced a new metric they termed learning rate (LR) sensitivity. This measures how much changing the learning rate impacts the final performance after training.
Using the LR sensitivity as an evaluation metric, the researchers then studied the impact of various well-known training techniques for transformers:
- Longer warmup - Gradually increasing the learning rate at the start of training reduced LR sensitivity more for larger models, improving stability. The standard warmup period of just 5000 steps was insufficient.
- Independent LR and weight decay - Contrary to common practice, decoupling the learning rate from the weight decay hyperparameter improved stability across scales. This agrees with a past paper by Loshchilov & Hutter, but isn't implemented in most libraries.
- Scaling depth vs width - LR sensitivity grew much faster when increasing model depth compared to width. This indicates instability shoots up more rapidly for deeper transformer architectures.
- Tracking model characteristics - Monitoring trends in measures like attention logit growth during training could predict upcoming instabilities before they emerged.
- Default optimization hyperparameters - The default epsilon value used in AdamW becomes too large at bigger scales. This results in model updates being too small to successfully train.
Benefits of Studying Small Models
This work demonstrates it's possible to gain significant insights into the training dynamics of gigantic AI models without direct access to massive compute capabilities. Findings from small models can provide guidance for teams building the next generation of huge parameter systems.
Being able to simulate unstable behaviors and experiment with techniques for improving stability in small settings allows more researchers to make progress on these problems. It opens up investigations that previously relied on access to thousands of GPUs.
And as model size continues to increase in pursuit of more capable AI, training instability will only become a bigger issue. So having techniques to study these challenges in resource-efficient ways will be crucial to continued progress in the field.
The Path Forward
Although small models can reflect the training dynamics of massive models, further verification is needed to confirm findings at even larger scales. However, the DeepMind team's work provides an initial set of insights that agree with observations from industry practitioners training gigantic systems.
By combining knowledge gained from investigating small models with theory and mathematics, researchers may be able to predict how instability evolves as AI models continue to grow. This can guide the design of new model architectures and training techniques tailored for stability.
With this increased understanding, development of the enormous AI systems of the future can avoid pitfalls and maximize the benefits these models may bring. The smaller-scale simulation approach opens up model training research to more minds and resources.
With continued clever investigation of small models, my only question is: when are we getting a 1000 trillion parameter model? Thanks for reading.