HITsz-TMG / UMOE-Scaling-Unified-Multimodal-LLMs
- ΠΏΠΎΠ½Π΅Π΄Π΅Π»ΡΠ½ΠΈΠΊ, 27 ΠΌΠ°Ρ 2024β―Π³. Π² 00:00:06
The codes about "Uni-MoE: Scaling Unified Multimodal Models with Mixture of Experts"
π Welcome to the repo of Uni-MOE!
Uni-MoE is a MoE-based unified multimodal model and can handle diverse modalities including audio, speech, image, text, and video.
Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang
Usage and License Notices: The data and checkpoint are intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA and Vicuna. The dataset and models trained using the dataset should not be used outside of research purposes.
Demo 2 contains the real-time understanding of speech (Starting from 30S).
The model architecture of Uni-MoE is shown below. Three training stages contain: 1) Utilize pairs from different modalities and languages to build connectors that map these elements to a unified language space, establishing a foundation for multimodal understanding; 2) Develop modality-specific experts using cross-modal data to ensure deep understanding, preparing for a cohesive multi-expert model; 3) Incorporate multiple trained experts into LLMs and refine the unified multimodal model using the LoRA technique on mixed multimodal data.
The following instructions are for Linux installation. We would like to recommend the requirements as follows.
git clone https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.git
cd UMOE-Scaling-Unified-Multimodal-LLMs/Uni_MoE
conda create -n unimoe python==3.9.16
conda activate unimoe
pip install -r env.txt
To use our model, all weights should be downloaded.
After downloading all of them, organize the weights as follows in 'Uni_MoE/checkpoint' folder:
βββ checkpoint
βββ Uni-MoE-audio-base
βββ Uni-MoE-audio-e2
βββ Uni-MoE-speech-base
βββ Uni-MoE-speech-e2
βββ Uni-MoE-speech-base-interval
βββ Uni-MoE-speech-v1.5
βββ clip-vit-large-patch14-336
βββ whisper-small
βββ BEATs_iter3_plus_AS2M.pt
Model | Checkpoint |
---|---|
vision encoder | CLIP ViT-L/14 336px |
speech encoder | whisper small |
audio encoder | Fine-tuned BEATs_iter3+ (AS2M) |
Uni-MoE-audio-base-model | Uni-MoE/Uni-MoE-audio-base |
Uni-MoE-audio-fine-tuned-chekpoint | Uni-MoE/Uni-MoE-audio-e2 |
Uni-MoE-speech-base-model | Uni-MoE/Uni-MoE-speech-base |
Uni-MoE-speech-fine-tuned-chekpoint | Uni-MoE/Uni-MoE-speech-e2 |
Uni-MoE-speech-base-interval | Uni-MoE/Uni-MoE-speech-base-interval |
Uni-MoE-speech-v1.5 | Uni-MoE/Uni-MoE-speech-v1.5 |
We use TTS technical to convert long text to speech to construct long speech understanding data.
Overall, all training tasks (16 comparative experiments covering models with single-expert and MoE configurations) are as follows:
Training Tasks | Data Types | Data Size | Epochs | Trainable Modules | Pretraining tasks |
---|---|---|---|---|---|
Audio-Language Pretraining | WaveCaps*, Audiocap*, MELD, ClothoV1 | 194K | 2 | Audio Q-former, Audio projection layer | - |
Speech-Language Pretraining | Common Voice (Short Speech) | 1.7M | 2 | Speech Q-former, Speech projection layer | - |
Single-Modality-Expert-Task1 | LLaVA-Instruction-150K(I-A) | 150K | 1 | LoRA, Speech projection layer | Speech-pretrain-task |
Single-Modality-Expert-Task2 | LLaVA-Instruction-150K(T-I) | 150K | 1 | LoRA, Image projection layer | Speech-pretrain-task |
Single-Modality-Expert-Task3 | LLaVA-Instruction-150K(I-A) | 150K | 1 | LoRA, Speech Q-former, Speech and Image projection layer | Speech-pretrain-task |
Single-Modality-Expert-Task4 | LLaVA-Instruction-150K(I-A), RACE(T-A), LibriSpeech | 271K | 1 | LoRA, Speech & Image projection | Speech-pretrain-task |
Single-Modality-Expert-Task5 | LLaVA-Instruction-150K(T-I), RACE(T-A), LibriSpeech | 271K | 1 | LoRA, Speech & Image projection | Speech-pretrain-task |
Single-Modality-Expert-Task6 | LLaVA-Instruction-150K(I-A), LLaVA-Instruction-150K(T-I), RACE(T-A), LibriSpeech | 421K | 1 | LoRA, Speech & Image projection | Speech-pretrain-task |
Single-Modality-Expert-Task7 | RACE(T-A), LibriSpeech, RACE(T-A)-MC | 209K | 1 | LoRA, Speech projection layer | Speech-pretrain-task |
Single-Modality-Expert-Task8 | WaveCaps*, Audiocap*, MELD, ClothoAQA, ClothoV1 | 203K | 1 | LoRA, Audio projection layer | Audio-pretrain-task |
MoE-Task1 | LLaVA-Instruction-Dataset(T-I), LLaVA-Instruction-150K(I-A), RACE(T-A), LibriSpeech, RACE(T-A)-MC | 509K | 3 | LoRA, Router, speech & image projection layer | LLava-V1.5-LoRA, Single-Modality-Expert-Tasks 2/3/7 |
MoE-Task1-short-speech | LLaVA-Instruction-Dataset(T-I), LLaVA-Instruction-150K(I-A) | 300K | 3 | LoRA, Router, speech & image projection layer | LLava-V1.5-LoRA, Single-Modality-Expert-Tasks 2/3/7 |
MoE-Task2 | Video-Instruction-150K, LLaVA-Instruction-Dataset(T-I), RACE(T-A), LibriSpeech, RACE(T-A)-MC | 459K | 2 | LoRA, Router, speech & image projection layer | Llava-v1.5-LoRA, Single-Modality-Expert-Tasks 2/3/7 |
MoE-Task3 | Video-Instruction-150K, LLaVA-Instruction-Dataset(T-I), WaveCaps*, Audiocap*, MELD, ClothoAQA, ClothoV1 | 453K | 2 | LoRA, Router, audio & image projection layer | LLava-V1.5-LoRA, Single-Modality-Expert-Tasks 2/3/8 |
Pure-MoE-Task1 | Video-Instruction-Dataset, LLaVA-Instruction-Dataset(T-I), WaveCaps*, Audiocap*, MELD, ClothoAQA, ClothoV1 | 453K | 2 | LoRA, Router, audio & image projection layer | LLava-V1.5-LoRA |
Pure-MoE-Task2 | Video-Instruction-Dataset, LLaVA-Instruction-Dataset(T-I), WaveCaps*, Audiocap*, MELD, ClothoAQA, ClothoV1 | 453K | 2 | LoRA, Router, audio & image projection layer | - |
*
refers to the fact that the dataset we use is only a subset. MC
represents the multi-choice setting. I-A
means image-audio pairs, which convert the question into the corresponding speech. T-I
shows the original text-image pairs. T-A
indicates the contextual paragraph of the RACE dataset is transferred into the long speech. Pretraining task
represents the tasks included in the previous training stage.
DataSet | Input Type |
---|---|
AOKVQA | Text-Image |
OKVQA | Text-Image |
VQAv2 | Text-Image |
ClothoAQA | Text-Audio |
ClothoV1 | Text-Audio |
ClothoV2 | Text-Audio |
POPE | Text-Image |
TextVQA | Text-Image |
MM-Vet | Text-Image |
SEEDBench(Image) | Text-Image |
MMBench | Text-Image |
MMBench-Audio | Text-Image-Speech(Long) |
English-High-School-Listening | Text-Speech(Long) |
RACE | Text-Speech(Long) |
MSVD | Text-Video-Audio |
Activitynet-QA | Text-Video-Audio |
We build a real speech understanding dataset to check the practical long speech recognition capabilities: English-High-School-Listening It comprises 150 questions related to long audio segments with an average length of 109 seconds, and 50 questions about short audio segments with an average length of 14 seconds.
inference_audio.sh
and inference_speech.sh
using bash inference_audio.sh
bash inference_speech.sh
or run the following commands to inference:cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_audio/inference_all.py
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_speech/inference_all.py
To launch the online demo ( It is highly recommended to launch the demo with Uni-MoE-speech-v1.5 that need the basic parameters of Uni-MoE-speech-base-interval), run:
cd /path/to/Uni_MoE
conda activate unimoe
python demo/demo.py
python demo/app.py
Training:
finetune_audio.sh
or finetune_speech.sh
using bash finetune_audio.sh
bash finetune_speech.sh
, remember to modify the training set with your own preference.finetune_speech_dp.sh
using bash finetune_speech_dp.sh
, remember to modify the training set with your own preference.Evaluation:
samples.json
.eval_audio.sh
or eval_speech.sh
using bash eval_audio.sh
bash eval_speech.sh
or run the following commands to eval:cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_audio/eval.py\
--data_path /path/to/clotho.json\
--data_type clothov1\
--output test.json
cd /path/to/Uni_MoE
conda activate unimoe
python Uni_MoE_speech/eval.py\
--data_path /path/to/vqa_eval.json\
--data_type vqa\
--output test.json
We recommend using 80GB GPU RAM to run all experiments.
If you find Uni-MoE useful for your research and applications, please cite using this BibTeX:
@article{li2024uni,
title={Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts},
author={Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min},
journal={arXiv preprint arXiv:2405.11273},
year={2024}
}