Nemotron-3-8B-Base-4K Guide

Latest Research & Updates

NVIDIA Releases Nemotron 3 Nano: A New 30B Hybrid Reasoning Model

NVIDIA has released Nemotron-3-Nano-30B, a groundbreaking hybrid Mixture-of-Experts model with best-in-class performance for reasoning and chat tasks. Features 30B total parameters with only 3.5B active per token and 1M context window.

Dec 15, 2024

Understanding the Hybrid Mamba-MoE Architecture

A deep dive into the innovative architecture behind Nemotron-3-Nano-30B. The model combines 23 Mamba-2 + MoE layers with 6 Attention layers, using 128 experts per layer with only 6 activated per token for exceptional efficiency.

Dec 16, 2024

Running Nemotron-3-Nano-30B Locally: A Community Guide

Community insights on deploying Nemotron-3-Nano-30B on consumer hardware. Users report best results with dual NVIDIA GPUs using RPC to avoid CPU offloading. Optimal settings: temp=0.6, top-p=0.95, top-k=20.

Dec 17, 2024

What is the Nemotron Model Family?

Nemotron is NVIDIA's family of enterprise-ready language models, including both the Nemotron-3-8B series and the groundbreaking Nemotron-3-Nano-30B. The 8B series offers efficient models with 4K context, perfect for most enterprise applications. The Nano 30B represents a leap forward with a hybrid Mamba-MoE architecture: 30B total parameters with only 3.5B active per token, achieving 60B+ quality at a fraction of the cost. Key innovation: The Nano 30B features a massive 1 million token context window, best-in-class reasoning performance on SWE-Bench, and supports English, German, Spanish, French, Italian, Japanese, plus coding languages.

Key Features

Hybrid Mamba-MoE Architecture

Combines 23 Mamba-2 + MoE layers with 6 Attention layers for exceptional efficiency and performance

1M Token Context (Nano 30B)

Massive 1 million token context window for comprehensive document understanding and reasoning

Efficient Parameter Usage

30B total params with only 3.5B active per token - 60B+ quality at 8B efficiency

Enterprise Ready

Commercial licensing with Apache 2.0 and NVIDIA support for production deployments

Multilingual Support

Native support for English, German, Spanish, French, Italian, Japanese, and coding languages

Best-in-Class Reasoning

Top performance on SWE-Bench, reasoning benchmarks, and chat tasks

NeMo Framework Compatible

Fully compatible with NVIDIA NeMo Framework for training and deployment

TensorRT Optimized

Optimized inference with TensorRT-LLM and FP8 quantization for NVIDIA GPUs

Frequently Asked Questions

Nemotron is NVIDIA's family of enterprise-ready language models. It includes the Nemotron-3-8B series with 4K context (base, chat, and specialized variants) and the newer Nemotron-3-Nano-30B with a 1M token context window. All models are designed for production use with commercial licensing.

The Nano 30B uses a hybrid Mamba-MoE architecture with 23 Mamba-2 + MoE layers and 6 Attention layers. Each MoE layer has 128 experts plus 1 shared expert, with only 6 experts activated per token. This results in 30B total parameters but only 3.5B active, achieving 60B+ model quality at fraction of the compute cost.

Nemotron-3-8B models support 4,096 tokens (4K context). The newer Nemotron-3-Nano-30B supports a massive 1 million token context window, making it ideal for long document processing, extensive codebase analysis, and complex reasoning tasks.

Nemotron-3-8B models are governed by the NVIDIA AI Foundation Models Community License Agreement. The Nemotron-3-Nano-30B is licensed under Apache 2.0, making it more permissive for commercial use. Both are ready for production deployment.

Nemotron models are compatible with NVIDIA NeMo Framework for training and fine-tuning. For inference, you can use TensorRT-LLM for optimized performance on NVIDIA GPUs. The models also support standard frameworks like HuggingFace Transformers and vLLM.

Nemotron-3-Nano-30B achieves best-in-class performance on SWE-Bench (software engineering), reasoning benchmarks, and chat tasks. Community testing shows it performs comparably to much larger models while being significantly more efficient with only 3.5B active parameters per token.

Nemotron models support English as the primary language. The Nano 30B variant additionally supports German, Spanish, French, Italian, and Japanese, along with various coding languages. All variants excel at code generation and understanding.

Nemotron-3-8B can run on single GPUs with 16GB+ VRAM. Nemotron-3-Nano-30B requires more resources - community users report best results with dual GPU setups using RPC to avoid CPU offloading. FP8 quantized versions (NVIDIA-Nemotron-3-Nano-30B-A3B-FP8) reduce VRAM requirements.

Nemotron-3-8B-Base-4K

Nemotron AI Demo Playground