Balance

NVIDIA: Nemotron Nano 12B 2 VL

Name: NVIDIA: Nemotron Nano 12B 2 VL
Brand: nvidia
Price: 200 USD
Rating: 3.6 (1 reviews)

NVIDIA Nemotron Nano 12B 2 VL is a cutting-edge 12-billion-parameter open multimodal reasoning model, specifically engineered for advanced video understanding and document intelligence tasks. This model introduces an innovative hybrid Transformer-Mamba architecture, which masterfully combines the high accuracy of traditional Transformers with the memory-efficient sequence modeling capabilities of Mamba. This results in significantly higher throughput and remarkably lower latency, making it ideal for demanding applications. The model processes both text and multi-image documents, generating natural-language outputs. It has been rigorously trained on high-quality, NVIDIA-curated synthetic datasets, meticulously optimized for optical-character recognition (OCR), complex chart reasoning, and comprehensive multimodal comprehension. Nemotron Nano 2 VL achieves leading results on OCRBench v2 and scores an impressive ≈ 74 average across key benchmarks like MMMU, MathVista, AI2D, OCRBench, OCR-Reasoning, ChartQA, DocVQA, and Video-MME, consistently outperforming prior open VL baselines. With Efficient Video Sampling (EVS), it adeptly handles long-form videos while substantially reducing inference costs. Key specifications include a generous Context Window of 131K tokens and a Max Output of 4K tokens. Pricing is competitive at $0.20 per 1M input tokens and $0.60 per 1M output tokens. It supports vision and streaming capabilities, making it an excellent choice for analysis and document processing. Open-weights, training data, and fine-tuning recipes are available under a permissive NVIDIA open license, with deployment supported across NeMo, NIM, and major inference runtimes. Access this STARTER tier model on Multi AI today.

multimodalvisiondocument AIvideo analysisopen source

72%Quality

131KContext Window

70%Speed