bytedance / UI-TARS
- четверг, 24 апреля 2025 г. в 00:00:02
🌐 Website | 🤗 Hugging Face Models
| 🔧 Deployment | 📑 Paper |
🖥️ UI-TARS-desktop
🏄 Midscene (Browser Automation) | 🫨 Discord
We also offer a UI-TARS-desktop version, which can operate on your local personal device. To use it, please visit https://github.com/bytedance/UI-TARS-desktop. To use UI-TARS in web automation, you may refer to the open-source project Midscene.js.
UI-TARS-1.5, an open-source multimodal agent built upon a powerful vision-language model. It is capable of effectively performing diverse tasks within virtual worlds.
Leveraging the foundational architecture introduced in our recent paper, UI-TARS-1.5 integrates advanced reasoning enabled by reinforcement learning. This allows the model to reason through its thoughts before taking action, significantly enhancing its performance and adaptability, particularly in inference-time scaling. Our new 1.5 version achieves state-of-the-art results across a variety of standard benchmarks, demonstrating strong reasoning capabilities and notable improvements over prior models.
Online Benchmark Evaluation
Benchmark type | Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA |
---|---|---|---|---|---|
Computer Use | OSworld (100 steps) | 42.5 | 36.4 | 28 | 38.1 (200 step) |
Windows Agent Arena (50 steps) | 42.1 | - | - | 29.8 | |
Browser Use | WebVoyager | 84.8 | 87 | 84.1 | 87 |
Online-Mind2web | 75.8 | 71 | 62.9 | 71 | |
Phone Use | Android World | 64.2 | - | - | 59.5 |
Grounding Capability Evaluation
Benchmark | UI-TARS-1.5 | OpenAI CUA | Claude 3.7 | Previous SOTA |
---|---|---|---|---|
ScreenSpot-V2 | 94.2 | 87.9 | 87.6 | 91.6 |
ScreenSpotPro | 61.6 | 23.4 | 27.7 | 43.6 |
Poki Game
Model | 2048 | cubinko | energy | free-the-key | Gem-11 | hex-frvr | Infinity-Loop | Maze:Path-of-Light | shapes | snake-solver | wood-blocks-3d | yarn-untangle | laser-maze-puzzle | tiles-master |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OpenAI CUA | 31.04 | 0.00 | 32.80 | 0.00 | 46.27 | 92.25 | 23.08 | 35.00 | 52.18 | 42.86 | 2.02 | 44.56 | 80.00 | 78.27 |
Claude 3.7 | 43.05 | 0.00 | 41.60 | 0.00 | 0.00 | 30.76 | 2.31 | 82.00 | 6.26 | 42.86 | 0.00 | 13.77 | 28.00 | 52.18 |
UI-TARS-1.5 | 100.00 | 0.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 |
Minecraft
Task Type | Task Name | VPT | DreamerV3 | Previous SOTA | UI-TARS-1.5 w/o Thought | UI-TARS-1.5 w/ Thought |
---|---|---|---|---|---|---|
Mine Blocks | (oak_log) | 0.8 | 1.0 | 1.0 | 1.0 | 1.0 |
(obsidian) | 0.0 | 0.0 | 0.0 | 0.2 | 0.3 | |
(white_bed) | 0.0 | 0.0 | 0.1 | 0.4 | 0.6 | |
200 Tasks Avg. | 0.06 | 0.03 | 0.32 | 0.35 | 0.42 | |
Kill Mobs | (mooshroom) | 0.0 | 0.0 | 0.1 | 0.3 | 0.4 |
(zombie) | 0.4 | 0.1 | 0.6 | 0.7 | 0.9 | |
(chicken) | 0.1 | 0.0 | 0.4 | 0.5 | 0.6 | |
100 Tasks Avg. | 0.04 | 0.03 | 0.18 | 0.25 | 0.31 |
Here we compare performance across different model scales of UI-TARS on the OSworld benchmark.
Benchmark Type | Benchmark | UI-TARS-72B-DPO | UI-TARS-1.5-7B | UI-TARS-1.5 |
---|---|---|---|---|
Computer Use | OSWorld | 24.6 | 27.5 | 42.5 |
GUI Grounding | ScreenSpotPro | 38.1 | 49.6 | 61.6 |
While UI-TARS-1.5 represents a significant advancement in multimodal agent capabilities, we acknowledge several important limitations:
We are providing early research access to our top-performing UI-TARS-1.5 model to facilitate collaborative research. Interested researchers can contact us at TARS@bytedance.com.
Looking ahead, we envision UI-TARS evolving into increasingly sophisticated agentic experiences capable of performing real-world actions, thereby empowering platforms such as doubao to accomplish more complex tasks for you :)
If you find our paper and model useful in your research, feel free to give us a cite.
@article{qin2025ui,
title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
journal={arXiv preprint arXiv:2501.12326},
year={2025}
}