lyuchenyang / Macaw-LLM
- ŃŠµŃŠ²ŠµŃŠ³, 1 ŠøŃŠ½Ń 2023āÆŠ³. в 00:00:07
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
¹ Chenyang Lyu, ² Bingshuai Liu, ³ Minghao Wu, ⓠZefeng Du,
āµ Xinting Huang, āµ Zhaopeng Tu, āµ Shuming Shi, āµ Longyue Wang
¹ Dublin City University, ² Xiamen University, ³ Monash University, ⓠUniversity of Macau, ⵠTencent AI Lab
Macaw-LLM is an exploratory endeavor that pioneers multi-modal language modeling by seamlessly combining image, video, audio, and text data, built upon the foundations of CLIP, Whisper, and LLaMA.
In recent years, the field of language modeling has witnessed remarkable advancements. However, the integration of multiple modalities, such as images, videos, audios, and text, has remained a challenging task. Macaw-LLM is a model of its kind, bringing together state-of-the-art models for processing visual, auditory, and textual information, namely CLIP, Whisper, and LLaMA.
Macaw-LLM boasts the following unique features:
Macaw-LLM is composed of three main components:
The integration of these models allows Macaw-LLM to process and analyze multi-modal data effectively.
Our novel alignment strategy enables faster adaptation by efficiently bridging multi-modal features to textual features. The process involves:
To install Macaw-LLM, follow these steps:
# Clone the repository
git clone https://github.com/lyuchenyang/Macaw-LLM.git
# Change to the Macaw-LLM directory
cd Macaw-LLM
# Install required packages
pip install -r requirements.txt
# Install ffmpeg
yum install ffmpeg -y
# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
cd ..
Downloading dataset:
Dataset preprocessing:
data/text/
, data/image/
, data/video/
python preprocess_data.py
python preprocess_data_supervised.py
python preprocess_data_unsupervised.py
Training:
./train.sh
Inference:
./inference.sh
While our model is still in its early stages, we believe that Macaw-LLM paves the way for future research in the realm of multi-modal language modeling. The integration of diverse data modalities holds immense potential for pushing the boundaries of artificial intelligence and enhancing our understanding of complex real-world scenarios. By introducing Macaw-LLM, we hope to inspire further exploration and innovation in this exciting area of study.
We welcome contributions from the community to improve and expand Macaw-LLM's capabilities.
More Language Models: We aim to extend Macaw-LLM by incorporating additional language models like Dolly, BLOOM, T-5, etc. This will enable more robust and versatile processing and understanding of multi-modal data.
Multilingual Support: Our next step is to support multiple languages, moving towards true multi-modal and multilingual language models. We believe this will significantly broaden Macaw-LLM's applicability and enhance its understanding of diverse, global contexts.
@misc{Macaw-LLM,
author = {Chenyang Lyu and Bingshuai Liu and Minghao Wu and Zefeng Du and Longyue Wang},
title = {Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/lyuchenyang/Macaw-LLM}},
}