DAMO-NLP-SG / Video-LLaMA
- воскресенье, 11 июня 2023 г. в 00:00:02
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
This is the repo for the Video-LLaMA project, which is working on empowering large language models with video and audio understanding capability.
The following checkpoints store learnable parameters (positional embedding layers, Video/Audio Q-former and linear projection layers) only.
Checkpoint | Link | Note |
---|---|---|
pretrain-vicuna7b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna7b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
pretrain-vicuna13b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna13b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
pretrain-ziya13b-zh | link | Pre-trained with Chinese LLM Ziya-13B |
finetune-ziya13b-zh | link | Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese) |
pretrain-billa7b-zh | link | Pre-trained with Chinese LLM BiLLA-7B |
finetune-billa7b-zh | link | Fine-tuned on machine-translated VideoChat instruction-following dataset (in Chinese) |
Checkpoint | Link | Note |
---|---|---|
pretrain-vicuna7b | link | Pre-trained on WebVid (2.5M video-caption pairs) and LLaVA-CC3M (595k image-caption pairs) |
finetune-vicuna7b-v2 | link | Fine-tuned on the instruction-tuning data from MiniGPT-4, LLaVA and VideoChat |
First, install ffmpeg
apt update
apt install ffmpeg
Then, create a conda environment:
conda env create -f environment.yml
conda activate videollama
Before using the repository, make sure you have obtained the following checkpoints:
python apply_delta.py \
--base /path/to/llama-13b \
--target /output/path/to/vicuna-13b --delta /path/to/vicuna-13b-delta
Use git-lfs
to download the learnable weights of our Video-LLaMA (i.e., positional embedding layer + Q-Former + linear projection layer):
git lfs install
git clone https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series
The above commands will download the model weights of all the Video-LLaMA variants. For sure, you can choose to download the weights on demand. For example, if you want to run Video-LLaMA with Vicuna-7B as language decoder locally, then:
wget https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune-vicuna7b-v2.pth
wget https://huggingface.co/DAMO-NLP-SG/Video-LLaMA-Series/resolve/main/finetune_vicuna7b_audiobranch.pth
should meet the requirement.
Firstly, set the llama_model
, imagebind_ckpt_path
, ckpt
and ckpt_2
in eval_configs/video_llama_eval_withaudio.yaml.
Then run the script:
python demo_audiovideo.py \
--cfg-path eval_configs/video_llama_eval_withaudio.yaml --gpu-id 0
The training of Video-LLaMA consists of two stages,
Pre-training on the Webvid-2.5M video caption dataset and LLaVA-CC3M image caption dataset.
Fine-tuning using the image-based instruction-tuning data from MiniGPT-4.
Download the metadata and video following the instruction from the official Github repo of Webvid. The folder structure of the dataset is shown below:
|webvid_train_data
|──filter_annotation
|────0.tsv
|──videos
|────000001_000050
|──────1066674784.mp4
|cc3m
|──filter_cap.json
|──image
|────GCC_train_000000000.jpg
|────...
Config the the checkpoint and dataset paths in video_llama_stage1_pretrain.yaml Run the script:
conda activate videollama
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/video_llama_stage1_pretrain.yaml
For now, the fine-tuning dataset consists of:
Config the the checkpoint and dataset paths in video_llama_stage2_finetune.yaml
conda activate videollama
torchrun --nproc_per_node=8 train.py --cfg-path ./train_configs/video_llama_stage2_finetune.yaml
We are grateful for the following awesome projects our Video-LLaMA arising from:
The logo of Video-LLaMA is generated by Midjourney.
If you find our project useful, hope you can star our repo and cite our paper as follows:
@article{damonlpsg2023videollama,
author = {Zhang, Hang and Li, Xin and Bing, Lidong},
title = {Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding},
year = 2023,
journal = {arXiv preprint arXiv:2306.02858}
url = {https://arxiv.org/abs/2306.02858}
}