NVIDIA Nemotron Nano 2 VL is a powerful, open 12-billion-parameter multimodal reasoning model, specifically engineered for advanced video understanding and comprehensive document intelligence. This model introduces an innovative hybrid Transformer-Mamba architecture, which skillfully combines the precision of transformers with the memory-efficient sequence modeling of Mamba. This results in significantly higher throughput and remarkably lower latency, making it ideal for demanding applications. Capable of processing both text and multi-image documents, Nemotron Nano 2 VL generates natural-language outputs. It has been rigorously trained on high-quality, NVIDIA-curated synthetic datasets, meticulously optimized for optical-character recognition (OCR), intricate chart reasoning, and broad multimodal comprehension. The model achieves leading results on OCRBench v2 and an impressive average score of ≈ 74 across benchmarks like MMMU, MathVista, AI2D, OCRBench, OCR-Reasoning, ChartQA, DocVQA, and Video-MME, outperforming previous open VL baselines. With Efficient Video Sampling (EVS), it effectively handles long-form videos while minimizing inference costs. This model is available for free, offering a generous 128K token context window and a 4K token max output. Its open-weights, training data, and fine-tuning recipes are released under a permissive NVIDIA open license, ensuring broad accessibility. Deployment is supported across NeMo, NIM, and major inference runtimes. Discover its capabilities for analysis and document processing today on Multi AI.
✅ Best For
🚀 Capabilities
❌ Limitations
Specifications
| Provider | nvidia |
| Context Window | 128,000 tokens |
| Max Output | 128,000 tokens |
| Minimum Plan | Economy |
Pricing
| Input Price | Free / 1M tokens |
| Output Price | Free / 1M tokens |
💡 With PRO subscription, cost is reduced by 20%