Baidu ERNIE 4.5 VL 424B A47B is a cutting-edge multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE 4.5 series. With 424B total parameters and 47B active per token, it is jointly trained on text and image data using a heterogeneous MoE architecture and modality-isolated routing. This enables exceptional cross-modal reasoning, detailed image understanding, and long-context generation, supporting up to 131,000 tokens. Fine-tuned with advanced techniques including SFT, DPO, UPO, and RLVR, ERNIE 4.5 VL 424B A47B supports both “thinking” and non-thinking inference modes. It is specifically designed for complex vision-language tasks in both English and Chinese, offering optimized performance and efficient scaling. The model can operate under 4-bit/8-bit quantization, making it versatile for various applications. It has a context window of 123K tokens and a max output of 4K tokens. Pricing is set at $0.42 per 1M input tokens and $1.25 per 1M output tokens, available on the STARTER access tier. Key capabilities include vision and streaming, making it ideal for analysis and document processing. Please note that this model does not support image generation.
✅ Best For
🚀 Capabilities
❌ Limitations
Specifications
| Provider | baidu |
| Context Window | 123,000 tokens |
| Max Output | 16,000 tokens |
| Minimum Plan | Balance |
Pricing
| Input Price | $0.4200 / 1M tokens |
| Output Price | $1.2500 / 1M tokens |
💡 With PRO subscription, cost is reduced by 20%