B7
Economy

ByteDance: UI-TARS 7B

by bytedance

ByteDance: UI-TARS 7B (UI-TARS-1.5) is a cutting-edge multimodal vision-language agent specifically engineered for GUI-based environments. This includes a wide range of applications such as desktop interfaces, web browsers, mobile operating systems, and even games. Built by ByteDance, it leverages the foundational UI-TARS framework, enhanced with reinforcement learning-based reasoning to enable robust action planning and execution across diverse virtual interfaces. This model achieves state-of-the-art results on numerous interactive and grounding benchmarks, including OSworld, WebVoyager, AndroidWorld, and ScreenSpot. It also demonstrates perfect task completion across various Poki games and significantly outperforms prior models in Minecraft agent tasks. UI-TARS-1.5 supports thought decomposition during inference and shows strong scaling across variants. The 1.5 version notably exceeds the performance of earlier 72B and 7B checkpoints, offering superior capabilities. It supports vision and streaming, with a context window of 128K tokens and a max output of 4K tokens. Pricing is competitive at $0.10/$0.20 per 1M tokens (input/output), and it's available for free access.

vision modelGUI agentmultimodalautomationByteDance
57%Quality
128KContext Window
75%Speed
Category
Economy
API access
Unified context
RAG + Knowledge Base
24/7 Support
Try This ModelCompare models

Best For

analysis
documents

🚀 Capabilities

Vision
Streaming

Limitations

no image generation

Specifications

Providerbytedance
Context Window128,000 tokens
Max Output4,096 tokens
Minimum PlanEconomy

Pricing

Input Price$0.1000 / 1M tokens
Output Price$0.2000 / 1M tokens

💡 With PRO subscription, cost is reduced by 20%

Ready to try ByteDance: UI-TARS 7B?

Get 1,000 tokens free on signup

Start for free