Alpha-VLLM / Lumina-T2X
- ะฒัะพัะฝะธะบ, 14 ะผะฐั 2024โฏะณ. ะฒ 00:00:05
Lumina-T2X is a unified framework for Text to Any Modality Generation
[๐ Lumina-T2X arXiv] [๐ฝ๏ธ Video Introduction of Lumina-T2X] [๐ join our WeChat]
[๐ค๏ธ Lumina-T2I 5B Checkpoints] [๐ค๏ธ Lumina-Next-T2I 2B Checkpoints (recommend)]
[๐น๏ธ GUI Demo for Lumina-T2I 5B model (node1)]
[๐ฎ GUI Demo for Lumina-Next-T2I 2B model (node2)] [๐ฎ GUI Demo for Lumina-Next-T2I 2B model (node3)]
Lumina-Next-T2I
model (checkpoint) which uses a 2B Next-DiT model as the backbone and Gemma-2B as the text encoder. Try it out at demo1 & demo2.Lumina-T2A
(Text-to-Audio) Demos. ExamplesLumina-T2I
.Lumina-T2I
for text-to-image generation.Warning
Since we are updating the code frequently, please pull the latest code:
git pull origin main
In order to quickly get you guys using our model, we built different versions of the GUI demo site.
[node1]
For more details about training and inference, please refer to Lumina-T2I and Lumina-Next-T2I
Warning
Lumina-T2X employs FSDP for training large diffusion models. FSDP shards parameters, optimizer states, and gradients across GPUs. Thus, at least 8 GPUs are required for full fine-tuning of the Lumina-T2X 5B model. Parameter-efficient Finetuning of Lumina-T2X shall be released soon.
Installation on your environment:
pip install git+https://github.com/Alpha-VLLM/Lumina-T2X
We introduce the
๐ Features:
[nextline]
and [nextframe]
tokens, our model can support resolution extrapolation, i.e., generating images/videos with out-of-domain resolutions not encountered during training, such as images from 768x768 to 1792x1792 pixels.720P Videos:
Prompt: The majestic beauty of a waterfall cascading down a cliff into a serene lake.
Prompt: A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.
360P Videos:
Note
Attention: Mouse over the playbar and click the audio button on the playbar to unmute it.
Prompt: Semiautomatic gunfire occurs with slight echo
Generated Audio:
Groundtruth:
Prompt: A telephone bell rings
Generated Audio:
Groundtruth:
Prompt: An engine running followed by the engine revving and tires screeching
Generated Audio:
Groundtruth:
Prompt: Birds chirping with insects buzzing and outdoor ambiance
Generated Audio:
Groundtruth:
Prompt: An electrifying ska tune with prominent saxophone riffs, energetic e-guitar and acoustic drums, lively percussion, soulful keys, groovy e-bass, and a fast tempo that exudes uplifting energy.
Generated Music:
Prompt: A high-energy synth rock/pop song with fast-paced acoustic drums, a triumphant brass/string section, and a thrilling synth lead sound that creates an adventurous atmosphere.
Generated Music:
Prompt: An uptempo electronic pop song that incorporates digital drums, digital bass and synthpad sounds.
Generated Music:
Prompt: A medium-tempo digital keyboard song with a jazzy backing track featuring digital drums, piano, e-bass, trumpet, and acoustic guitar.
Generated Music:
Prompt: This low-quality folk song features groovy wooden percussion, bass, piano, and flute melodies, as well as sustained strings and shimmering shakers that create a passionate, happy, and joyful atmosphere.
Generated Music:
We present three multilingual capabilities of Lumina-Next-2B.
Generating Images conditioned on Chinese poems:
Generating Images with multilignual prompts:
Generating Images with emojis:
We support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders. Additionally, we offer features such as 1D-RoPE, image enhancement, and more.
@article{gao2024luminat2x,
title={Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers},
author={Peng Gao and Le Zhuo and Ziyi Lin and Chris Liu and Junsong Chen and Ruoyi Du and Enze Xie and Xu Luo and Longtian Qiu and Yuhang Zhang and Chen Lin and Rongjie Huang and Shijie Geng and Renrui Zhang and Junlin Xi and Wenqi Shao and Zhengkai Jiang and Tianshuo Yang and Weicai Ye and He Tong and Jingwen He and Yu Qiao and Hongsheng Li},
journal={arXiv preprint arXiv:2405.05945},
year={2024}
}