ByteDance: UI-TARS 7B (UI-TARS-1.5) is a cutting-edge multimodal vision-language agent specifically engineered for GUI-based environments. This includes a wide range of applications such as desktop interfaces, web browsers, mobile operating systems, and even games. Built by ByteDance, it leverages the foundational UI-TARS framework, enhanced with reinforcement learning-based reasoning to enable robust action planning and execution across diverse virtual interfaces. This model achieves state-of-the-art results on numerous interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across various Poki games and significantly outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants. The 1.5 version notably exceeds the performance of earlier 72B and 7B checkpoints, offering superior capabilities. It supports vision and streaming, with a context window of 128K tokens and a max output of 4K tokens. Pricing is competitive at $0.10/$0.20 per 1M tokens (input/output), and it's available for free access.
✅ Best For
🚀 Capabilities
❌ Limitations
Specifications
| Provider | bytedance |
| Context Window | 128,000 tokens |
| Max Output | 4,096 tokens |
| Minimum Plan | Economy |
Pricing
| Input Price | $0.1000 / 1M tokens |
| Output Price | $0.2000 / 1M tokens |
💡 With PRO subscription, cost is reduced by 20%