A general fine-tuning kit geared toward Stable Diffusion 2.1, Stable Diffusion 3, DeepFloyd, and SDXL.
SimpleTuner 💹
⚠️Warning: The scripts in this repository have the potential to damage your training data. Always maintain backups before proceeding.
SimpleTuner is a repository dedicated to a set of experimental scripts designed for training optimization. The project is geared towards simplicity, with a focus on making the code easy to read and understand. This codebase serves as a shared academic exercise, and contributions are welcome.
Simplicity: Aiming to have good default settings for most use cases, so less tinkering is required.
Versatility: Designed to handle a wide range of image quantities - from small datasets to extensive collections.
Cutting-Edge Features: Only incorporates features that have proven efficacy, avoiding the addition of untested options.
Tutorial
Please fully explore this README before embarking on the tutorial, as it contains vital information that you might need to know first.
For a quick start without reading the full documentation, you can use the Quick Start guide.
For memory-constrained systems, see the DeepSpeed document which explains how to use 🤗Accelerate to configure Microsoft's DeepSpeed for optimiser state offload.
Features
Multi-GPU training
Image and caption features (embeds) are cached to the hard drive in advance, so that training runs faster and with less memory consumption
Aspect bucketing: support for a variety of image sizes and aspect ratios, enabling widescreen and portrait training.
Refiner LoRA or full u-net training for SDXL
Most models are trainable on a 24G GPU, or even down to 16G at lower base resolutions.
LoRA training for PixArt, SDXL, SD3, and SD 2.x that uses less than 16G VRAM
Quantised LoRA training, using low-precision base model or text encoder weights to reduce VRAM consumption while still allowing DreamBooth.
Optional EMA (Exponential moving average) weight network to counteract model overfitting and improve training stability. Note: This does not apply to LoRA.
Train directly from an S3-compatible storage provider, eliminating the requirement for expensive local storage. (Tested with Cloudflare R2 and Wasabi S3)
An SDXL-based model with ChatGLM (General Language Model) 6B as its text encoder, doubling the hidden dimension size and substantially increasing the level of local detail included in the prompt embeds.
Kolors support is almost as deep as SDXL, minus ControlNet training support.
Hardware Requirements
EMA (exponential moving average) weights are a memory-heavy affair, but provide fantastic results at the end of training. Options like --ema_cpu_only can improve this situation by loading EMA weights onto the CPU and then keeping them there.
Without EMA, more care must be taken not to drastically change the model leading to "catastrophic forgetting" through the use of regularisation data.
GPU vendors
NVIDIA - pretty much anything 3090 and up is a safe bet. YMMV.
AMD - SDXL LoRA and UNet are verified working on a 7900 XTX 24GB. Lacking xformers, it will likely use more memory than Nvidia equivalents
Apple - LoRA and full u-net tuning are tested to work on an M3 Max with 128G memory, taking about 12G of "Wired" memory and 4G of system memory for SDXL.
You likely need a 24G or greater machine for machine learning with M-series hardware due to the lack of memory-efficient attention.
Flux.1 [dev, schnell]
A100-40G (LoRA, rank-16 or lower)
A100-80G (LoRA, up to rank-256)
3x A100-80G (Full tuning, DeepSpeed ZeRO 1)
1x A100-80G (Full tuning, DeepSpeed ZeRO 3)
Flux prefers being trained with multiple GPUs.
SDXL, 1024px
A100-80G (EMA, large batches, LoRA @ insane batch sizes)
A6000-48G (EMA@768px, no EMA@1024px, LoRA @ high batch sizes)
A100-40G (no EMA@1024px, no EMA@768px, EMA@512px, LoRA @ high batch sizes)
Enable debug logs for a more detailed insight by adding export SIMPLETUNER_LOG_LEVEL=DEBUG to your environment file.
For performance analysis of the training loop, setting SIMPLETUNER_TRAINING_LOOP_LOG_LEVEL=DEBUG will have timestamps that hilight any issues in your configuration.
For a comprehensive list of options available, consult this documentation.