VideoCrafter / VideoCrafter
- суббота, 8 апреля 2023 г. в 00:14:19
A Toolkit for Text-to-Video Generation and Editing
It currently includes the following THREE types of models:
We provide a base text-to-video (T2V) generation model based on the latent video diffusion models (LVDM). It can synthesize realistic videos based on the input text descriptions.
"Campfire at night in a snowy forest with starry sky in the background." | "Cars running on the highway at night." | "close up of a clown fish swimming. 4K" | "astronaut riding a horse" |
![]() |
![]() |
![]() |
![]() |
Based on the pretrained LVDM, we can create our own video generation models by finetuning it on a set of video clips or images describing a certain concept.
We adopt LoRA to implement the finetuning as it is easy to train and requires fewer computational resources.
Below are generation results from our four VideoLoRA models that are trained on four different styles of video clips.
By providing a sentence describing the video content along with a LoRA trigger word (specified during LoRA training), it can generate videos with the desired style(or subject/concept).
Results of inputting A monkey is playing a piano, ${trigger_word}
to the four VideoLoRA models:
![]() |
![]() |
![]() |
![]() |
"Loving Vincent style" | "frozenmovie style" | "MakotoShinkaiYourName style" | "coco style" |
To enhance the controllable abilities of the T2V model, we developed conditional adapter inspired by T2I-adapter. By pluging a lightweight adapter module to the T2V model, we can obtained generation results with more detailed control signals such as depth.
input text: Ironman is fighting against the enemy, big fire in the background, photorealistic, 4k
![]() |
![]() |
![]() |
![]() |
![]() |
conda create -n lvdm python=3.8.5
conda activate lvdm
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install pytorch-lightning==1.8.3 omegaconf==2.1.1 einops==0.3.0 transformers==4.25.1
pip install opencv-python==4.1.2.30 imageio==2.9.0 imageio-ffmpeg==0.4.2
pip install av moviepy
pip install -e .
conda create -n lvdm python=3.8.5
conda activate lvdm
pip install -r requirements_xformer.txt
model.ckpt
in models/base_t2v/model.ckpt
. PROMPT="astronaut riding a horse"
OUTDIR="results/"
BASE_PATH="models/base_t2v/model.ckpt"
CONFIG_PATH="models/base_t2v/model_config.yaml"
python scripts/sample_text2video.py \
--ckpt_path $BASE_PATH \
--config_path $CONFIG_PATH \
--prompt "$PROMPT" \
--save_dir $OUTDIR \
--n_samples 1 \
--batch_size 1 \
--seed 1000 \
--show_denoising_progress
gpu_id
: specify the gpu index you want to useddp
: better to enable it if you have multiple GPUssample_text2video_multiGPU.sh
Same with 1-1: Download pretrained T2V models via Google Drive / Hugging Face, and put the model.ckpt
in models/base_t2v/model.ckpt
.
Download pretrained VideoLoRA models via this Google Drive / Hugging Face (can select one videolora model), and put it in models/videolora/${model_name}.ckpt
.
Input the following commands in terminal, it will start running in the GPU 0.
PROMPT="astronaut riding a horse"
OUTDIR="results/videolora"
BASE_PATH="models/base_t2v/model.ckpt"
CONFIG_PATH="models/base_t2v/model_config.yaml"
LORA_PATH="models/videolora/lora_001_Loving_Vincent_style.ckpt"
TAG=", Loving Vincent style"
python scripts/sample_text2video.py \
--ckpt_path $BASE_PATH \
--config_path $CONFIG_PATH \
--prompt "$PROMPT" \
--save_dir $OUTDIR \
--n_samples 1 \
--batch_size 1 \
--seed 1000 \
--show_denoising_progress \
--inject_lora \
--lora_path $LORA_PATH \
--lora_trigger_word "$TAG" \
--lora_scale 1.0
LORA_PATH="models/videolora/lora_001_Loving_Vincent_style.ckpt"
TAG=", Loving Vincent style"
LORA_PATH="models/videolora/lora_002_frozenmovie_style.ckpt"
TAG=", frozenmovie style"
LORA_PATH="models/videolora/lora_003_MakotoShinkaiYourName_style.ckpt"
TAG=", MakotoShinkaiYourName style"
LORA_PATH="models/videolora/lora_004_coco_style.ckpt"
TAG=", coco style"
If your find the lora effect is either too large or too small, you can adjust the lora_scale
argument to control the strength.
The effect of LoRA weights can be controlled by the lora_scale
. local_scale=0
indicates using the original base model, while local_scale=1
indicates using the full lora weights. It can also be slightly larger than 1 to emphasize more effect from lora.
scale=0.0 | scale=0.25 | scale=0.5 |
![]() |
![]() |
![]() |
scale=0.75 | scale=1.0 | scale=1.5 |
![]() |
![]() |
![]() |
python gradio_app.py
"A man playing a saxophone with musical notes flying out." | "Flying through an intense battle between pirate ships in a stormy ocean" | "Horse drinking water." | "Woman in sunset." |
![]() |
![]() |
![]() |
![]() |
"Humans building a highway on mars, highly detailed" | "A blue unicorn flying over a mystical land" | "Robot dancing in times square" | "A 3D model of an elephant origami. Studio lighting." |
![]() |
![]() |
![]() |
![]() |
If your have any comments or questions, feel free to contact Yingqing He, Haoxin Chen or Menghan Xia.
We develop this repository for RESEARCH purposes, so it can only be used for personal/research/non-commercial purposes.