xinyu1205 / Recognize_Anything-Tag2Text
- суббота, 10 июня 2023 г. в 00:00:03
Code for the Recognize Anything Model and Tag2Text Model
Official PyTorch Implementation of the Recognize Anything Model (RAM) and the Tag2Text Model.
When combined with localization models (Grounded-SAM), Tag2Text and RAM form a strong and general pipeline for visual semantic analysis.
Recognition and localization are two foundation computer vision tasks.
![]() |
![]() |
![]() |
2023/06/08
: We release the Recognize Anything Model (RAM) Tag2Text web demo 2023/06/07
: We release the Recognize Anything Model (RAM), a strong image tagging model!2023/06/05
: Tag2Text is combined with Prompt-can-anything.2023/05/20
: Tag2Text is combined with VideoChat.2023/04/20
: We marry Tag2Text with with Grounded-SAM.2023/04/10
: Code and checkpoint is available Now!2023/03/14
: Tag2Text web demo name | backbone | Data | Illustration | Checkpoint | |
---|---|---|---|---|---|
1 | RAM-Swin | Swin-Large | COCO, VG, SBU, CC-3M, CC-12M | Demo version can recognize any common category with high accuracy. | Download link |
2 | Tag2Text-Swin | Swin-Base | COCO, VG, SBU, CC-3M, CC-12M | Demo version with comprehensive captions. | Download link |
pip install -r requirements.txt
Download RAM pretrained checkpoints.
Get the English and Chinese outputs of the images:
python inference_ram.py --image images/1641173_2291260800.jpg \ --pretrained pretrained/ram_swin_large_14m.pth
RAM Zero-Shot Inference is Comming!
pip install -r requirements.txt
Download Tag2Text pretrained checkpoints.
Get the tagging and captioning results:
python inference_tag2text.py --image images/1641173_2291260800.jpg \ --pretrained pretrained/tag2text_swin_14m.pthOr get the tagging and sepcifed captioning results (optional):
python inference_tag2text.py --image images/1641173_2291260800.jpg \ --pretrained pretrained/tag2text_swin_14m.pth \ --specified-tags "cloud,sky"
If you find our work to be useful for your research, please consider citing.
@misc{zhang2023recognize,
title={Recognize Anything: A Strong Image Tagging Model},
author={Youcai Zhang and Xinyu Huang and Jinyu Ma and Zhaoyang Li and Zhaochuan Luo and Yanchun Xie and Yuzhuo Qin and Tong Luo and Yaqian Li and Shilong Liu and Yandong Guo and Lei Zhang},
year={2023},
eprint={2306.03514},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{huang2023tag2text,
title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
journal={arXiv preprint arXiv:2303.05657},
year={2023}
}
This work is done with the help of the amazing code base of BLIP, thanks very much!
We also want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying Tag2Text with Grounded-SAM.