Nemotron-3-8B-Base-4K

NVIDIA's enterprise-ready 8B parameter foundation model for custom LLM development

Nemotron AI Demo Playground

Loading Space...

Latest Research & Updates

NVIDIA Releases Nemotron 3 Nano: A New 30B Hybrid Reasoning Model

NVIDIA Releases Nemotron 3 Nano: A New 30B Hybrid Reasoning Model

NVIDIA has released Nemotron-3-Nano-30B, a groundbreaking hybrid Mixture-of-Experts model with best-in-class performance for reasoning and chat tasks. Features 30B total parameters with only 3.5B active per token and 1M context window.

NVIDIA Research
Dec 15, 2024
Understanding the Hybrid Mamba-MoE Architecture

Understanding the Hybrid Mamba-MoE Architecture

A deep dive into the innovative architecture behind Nemotron-3-Nano-30B. The model combines 23 Mamba-2 + MoE layers with 6 Attention layers, using 128 experts per layer with only 6 activated per token for exceptional efficiency.

AI Architecture Team
Dec 16, 2024
Running Nemotron-3-Nano-30B Locally: A Community Guide

Running Nemotron-3-Nano-30B Locally: A Community Guide

Community insights on deploying Nemotron-3-Nano-30B on consumer hardware. Users report best results with dual NVIDIA GPUs using RPC to avoid CPU offloading. Optimal settings: temp=0.6, top-p=0.95, top-k=20.

Community Contributors
Dec 17, 2024
Nemotron Architecture

What is the Nemotron Model Family?

Nemotron is NVIDIA's family of enterprise-ready language models, including both the Nemotron-3-8B series and the groundbreaking Nemotron-3-Nano-30B. The 8B series offers efficient models with 4K context, perfect for most enterprise applications. The Nano 30B represents a leap forward with a hybrid Mamba-MoE architecture: 30B total parameters with only 3.5B active per token, achieving 60B+ quality at a fraction of the cost. Key innovation: The Nano 30B features a massive 1 million token context window, best-in-class reasoning performance on SWE-Bench, and supports English, German, Spanish, French, Italian, Japanese, plus coding languages.
C
Hybrid Mamba-MoE architecture
W
Up to 1M token context (Nano 30B)
B
Best-in-class reasoning
L
NeMo & TensorRT optimized

How to Start with Nemotron-3-8B

1

Set Up Environment

Install NVIDIA NeMo Framework and required dependencies

2

Download Model

Access Nemotron-3-8B-Base-4K from HuggingFace or NVIDIA

3

Fine-tune

Customize the model for your enterprise use case

4

Deploy

Deploy with TensorRT-LLM for optimized inference

Key Features

C

Hybrid Mamba-MoE Architecture

Combines 23 Mamba-2 + MoE layers with 6 Attention layers for exceptional efficiency and performance

W

1M Token Context (Nano 30B)

Massive 1 million token context window for comprehensive document understanding and reasoning

Z

Efficient Parameter Usage

30B total params with only 3.5B active per token - 60B+ quality at 8B efficiency

B

Enterprise Ready

Commercial licensing with Apache 2.0 and NVIDIA support for production deployments

G

Multilingual Support

Native support for English, German, Spanish, French, Italian, Japanese, and coding languages

B

Best-in-Class Reasoning

Top performance on SWE-Bench, reasoning benchmarks, and chat tasks

L

NeMo Framework Compatible

Fully compatible with NVIDIA NeMo Framework for training and deployment

R

TensorRT Optimized

Optimized inference with TensorRT-LLM and FP8 quantization for NVIDIA GPUs

Frequently Asked Questions

Nemotron is NVIDIA's family of enterprise-ready language models. It includes the Nemotron-3-8B series with 4K context (base, chat, and specialized variants) and the newer Nemotron-3-Nano-30B with a 1M token context window. All models are designed for production use with commercial licensing.

The Nano 30B uses a hybrid Mamba-MoE architecture with 23 Mamba-2 + MoE layers and 6 Attention layers. Each MoE layer has 128 experts plus 1 shared expert, with only 6 experts activated per token. This results in 30B total parameters but only 3.5B active, achieving 60B+ model quality at fraction of the compute cost.

Nemotron-3-8B models support 4,096 tokens (4K context). The newer Nemotron-3-Nano-30B supports a massive 1 million token context window, making it ideal for long document processing, extensive codebase analysis, and complex reasoning tasks.

Nemotron-3-8B models are governed by the NVIDIA AI Foundation Models Community License Agreement. The Nemotron-3-Nano-30B is licensed under Apache 2.0, making it more permissive for commercial use. Both are ready for production deployment.

Nemotron models are compatible with NVIDIA NeMo Framework for training and fine-tuning. For inference, you can use TensorRT-LLM for optimized performance on NVIDIA GPUs. The models also support standard frameworks like HuggingFace Transformers and vLLM.

Nemotron-3-Nano-30B achieves best-in-class performance on SWE-Bench (software engineering), reasoning benchmarks, and chat tasks. Community testing shows it performs comparably to much larger models while being significantly more efficient with only 3.5B active parameters per token.

Nemotron models support English as the primary language. The Nano 30B variant additionally supports German, Spanish, French, Italian, and Japanese, along with various coding languages. All variants excel at code generation and understanding.

Nemotron-3-8B can run on single GPUs with 16GB+ VRAM. Nemotron-3-Nano-30B requires more resources - community users report best results with dual GPU setups using RPC to avoid CPU offloading. FP8 quantized versions (NVIDIA-Nemotron-3-Nano-30B-A3B-FP8) reduce VRAM requirements.