NVIDIA Nemotron Nano 12B 2 VL is a cutting-edge 12-billion-parameter open multimodal reasoning model, specifically engineered for advanced video understanding and document intelligence tasks. This model introduces an innovative hybrid Transformer-Mamba architecture, which masterfully combines the high accuracy of traditional Transformers with the memory-efficient sequence modeling capabilities of Mamba. This results in significantly higher throughput and remarkably lower latency, making it ideal for demanding applications. The model processes both text and multi-image documents, generating natural-language outputs. It has been rigorously trained on high-quality, NVIDIA-curated synthetic datasets, meticulously optimized for optical-character recognition (OCR), complex chart reasoning, and comprehensive multimodal comprehension. Nemotron Nano 2 VL achieves leading results on OCRBench v2 and scores an impressive ≈ 74 average across key benchmarks like MMMU, MathVista, AI2D, OCRBench, OCR-Reasoning, ChartQA, DocVQA, and Video-MME, consistently outperforming prior open VL baselines. With Efficient Video Sampling (EVS), it adeptly handles long-form videos while substantially reducing inference costs. Key specifications include a generous Context Window of 131K tokens and a Max Output of 4K tokens. Pricing is competitive at $0.20 per 1M input tokens and $0.60 per 1M output tokens. It supports vision and streaming capabilities, making it an excellent choice for analysis and document processing. Open-weights, training data, and fine-tuning recipes are available under a permissive NVIDIA open license, with deployment supported across NeMo, NIM, and major inference runtimes. Access this STARTER tier model on Multi AI today.
✅ Best For
🚀 Capabilities
❌ Limitations
Specifications
| Provider | nvidia |
| Context Window | 131,072 tokens |
| Max Output | 4,096 tokens |
| Minimum Plan | Balance |
Pricing
| Input Price | $0.2000 / 1M tokens |
| Output Price | $0.6000 / 1M tokens |
💡 With PRO subscription, cost is reduced by 20%