JonasGeiping / cramming
- воскресенье, 1 января 2023 г. в 00:36:40
Cramming the training of a (BERT-type) language model into limited compute.
This repository contains code to replicate our research described in "Cramming: Training a Language Model on a Single GPU in One Day". We experiment with language model pretraining a BERT-type model with limited compute, wondering "how bad can it really be"?
You can find our paper here: https://arxiv.org/abs/2212.14034, and the abstract below:
Recent trends in language modeling have focused on increasing performance through scaling, and have resulted in an environment where training language models is out of reach for most researchers and practitioners. While most in the community are asking how to push the limits of extreme computation, we ask the opposite question:
How far can we get with a single GPU in just one day?
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.
Setting:
torch
transformers
, tokenizers
, datasets
hydra-core
deepspeed
flash-attention
psutil
einops
zstandard
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
, then
git clone https://github.com/google-research/deduplicate-text-datasets/tree/dev-v1
and then run cargo install --target-dir ../cramming/dedup
Use the pretrain.py
script to pretrain with limited compute. This repository uses hydra (https://hydra.cc/docs/intro/), so all fields in cramming/config
can be modified on the command line. For example, the budget
can be modified by providing budget=48
as additional argument, or the learning rate can be modified via train.optim.lr=1e-4
. Check out the configuration folder to see all arguments.
Your first step should be to verify the installed packages. To do so, you can run python pretrain.py dryrun=True
, which will run the default sanity check for a single iteration. From there, you can enable additional functionality. For example, modify the architecture, e.g. arch=bert-original
and training setup train=bert-original
.
To really train a language model, you need to switch away from the sanity check dataset to at least data=bookcorpus-wikipedia
.
The data sources from data.sources
will be read, normalized and pretokenized before training starts and cached into a database. Subsequent calls with the same configuration will reused this database of tokenized sequences. By default, a new tokenizer will also be constructed and saved during this process. Important data options are data.max_entries_in_raw_dataset
, which defines how much raw data will be loaded. For example, for a large data source such as C4, only a subset of raw data will be downloaded. Then, max_seq_in_tokenized_dataset
bottlenecks how many processed sequences will be stored in the database. This number should be larger than the number of sequences expected to be read within the budget.
Additional Notes:
python pretrain.py data=... dryrun=True
, which dry-runs the training, but runs the full data preprocessing. Later runs can then re-use the cached data.impl.threads
. Especially the deduplication code does require substantial amounts of RAM.bookcorpus-wikipedia
only, which preprocesses comparatively quickly and only then look into the full processed and filtered C4.To evaluate pretrained models on GLUE (or some GLUE tasks), use eval.py
. This script searches for saved models in the base directory. Given the name of a previous run, this script will, by default, retrieve the latest checkpoint saved with this name, and then run evaluations.
You can log runs to your weights&biases account. To do so, simply modify wandb.entity
and wandb.project
on the command line or at cramming/config/wandb/default.yaml
.
To replicate the final recipe discussed in the paper, run
python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=c4-subset-processed
to pretrain and
python eval.py eval=GLUE_sane name=amp_b4096_c5_o3_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True
to evaluate the model. The recipe called "crammed BERT" in the paper corresponds to the architecture called bert-c5
trained with training setup bert-o3
on data c4-subset-processed
.
Pretraining: Single GPU:
python pretrain.py name=bert data=bookcorpus-wikipedia arch=bert-original train=bert-original
Multi-GPU:
torchrun --nproc_per_node=4 --standalone pretrain.py name=bert4gpu data=bookcorpus-wikipedia arch=bert-original train=bert-original
Eval a huggingface checkpoint:
python eval.py dryrun=True eval=rte name=bert-finetuning eval.checkpoint=hf://bert-base-uncased
Sanity check for distributed code on CPU:
torchrun --nproc_per_node=4 --standalone pretrain.py name=speedtest1 dryrun=True data=sanity-check-2 impl.backend=gloo
Additional examples for recipes can be found in the /scripts
folder.
The following options are currently broken/limited/work-in-progress. Use these at your own discretion, or open a pull-request with a fix.
jit.script
fusion, should move toward new torch.compile
implementation at some point. The current inductor
hook is also non-functional.Please, feel free to contact us with any questions, or open an issue on Github.