Inference models

nvidia/nemotron-3-ultra

Lambda — Sat, 06 Jun 2026 15:42:26 GMT

TL;DR: token throughput

All benchmarks use the single NVFP4 checkpoint (nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-NVFP4) on a decode-heavy 8K-input / 64K-output workload at 256 concurrent requests. Per-user generation throughput is aggregate generation tok/s divided by the concurrency level — the rate each individual user sees their response stream back.

LiquidAI/LFM2.5-8B-A1B

Lambda — Wed, 03 Jun 2026 14:44:31 GMT

TL;DR: token throughput

SGLang vLLM

Hardware	Gen. throughput	Per-user gen	Total throughput	TTFT (mean)	ITL (mean)
1× NVIDIA B200 GPU	6,098 tok/s	206 tok/s	30,489 tok/s	792 ms	4.9 ms
1× NVIDIA H100 GPU	3,714 tok/s	125 tok/s	18,572 tok/s	1,248 ms	8.0 ms
1× NVIDIA A100 GPU	1,950 tok/s	68 tok/s	9,751 tok/s	3,594 ms	14.7 ms

Hardware	Gen. throughput	Per-user gen	Total throughput	TTFT (mean)	ITL (mean)
1× NVIDIA B200 GPU	7,253 tok/s	238 tok/s	36,267 tok/s	433 ms	4.5 ms
1× NVIDIA H100 GPU	3,787 tok/s	123 tok/s	18,937 tok/s	568 ms	8.2 ms
1× NVIDIA A100 GPU	1,971 tok/s	64 tok/s	9,853 tok/s	962 ms	15.7 ms

HuggingFaceBio/Carbon-3B

Lambda — Wed, 03 Jun 2026 14:44:28 GMT

TL;DR: nucleotide throughput

Both Carbon-500M and Carbon-3B run on a single NVIDIA A10 GPU, served with SGLang or vLLM. Because Carbon tokenizes DNA as non-overlapping 6-mers, each token carries roughly 6 base pairs, so the token rates below correspond to roughly 6× as many base pairs per second.

stepfun-ai/Step-3.7-Flash

Lambda — Wed, 03 Jun 2026 14:44:24 GMT

TL;DR: token throughput

vLLM with MTP speculative decoding, workload 8192 in / 2048 out tokens at 32 concurrent requests.

moonshotai/Kimi-K2.6

Lambda — Thu, 30 Apr 2026 14:02:36 GMT

TL;DR: token throughput

vLLM

Hardware	Gen. throughput	Per-user gen	Total throughput	TTFT (mean)	ITL (mean)
NVIDIA HGX B200	1408 tok/s	44 tok/s	7046 tok/s	2264 ms	44.5 ms

deepseek-ai/DeepSeek-V4-Pro

Lambda — Tue, 28 Apr 2026 16:51:18 GMT

TL;DR: token throughput

vLLM

Hardware	Gen. throughput	Per-user gen	Total throughput	TTFT (mean)	ITL (mean)
NVIDIA HGX B200	911.92 tok/s	28.50 tok/s	4,561.38 tok/s	1,186.15 ms	55.79 ms

deepseek-ai/DeepSeek-V4-Flash

Lambda — Mon, 27 Apr 2026 21:16:57 GMT

TL;DR: token throughput

SGLang vLLM

Hardware	Gen. throughput	Per-user gen	Total throughput	TTFT (mean)	ITL (mean)
NVIDIA HGX B200 (native FP4+FP8 build)	1,222 tok/s	38 tok/s	11,000 tok/s	1,701 ms	66 ms
NVIDIA HGX H100 (FP8-quantized build)	1,262 tok/s	39 tok/s	11,361 tok/s	2,463 ms	60 ms

Hardware	Gen. throughput	Per-user gen	Total throughput	TTFT (mean)	ITL (mean)
NVIDIA HGX B200 (native FP4+FP8 build)	1,469 tok/s	46 tok/s	13,217 tok/s	1,452 ms	20 ms

zai-org/GLM-5.1

Lambda — Tue, 07 Apr 2026 16:15:31 GMT

TL;DR: token throughput

SGLang vLLM

Hardware	Gen. throughput	Per-user gen.	Total throughput	TTFT	ITL
1× NVIDIA HGX B200	1,345 tok/s	42.0 tok/s/user	6,727 tok/s	1,073ms	59ms

Hardware	Gen. throughput	Per-user gen.	Total throughput	TTFT	ITL
1× NVIDIA HGX B200	1,265 tok/s	39.5 tok/s/user	6,327 tok/s	1,317ms	58ms

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B

Lambda — Thu, 12 Mar 2026 16:48:32 GMT

TL;DR: token throughput

vLLM

Hardware	Gen. throughput	TTFT	ITL
2× NVIDIA B200 GPUs (NVFP4)	2,057 tok/s	4,040ms	12ms
1× NVIDIA B200 GPU (NVFP4)	1,517 tok/s	4,455ms	16ms
2× NVIDIA B200 GPUs (FP8)	1,847 tok/s	3,948ms	13ms
2× NVIDIA H100 GPUs (FP8)	1,116 tok/s	4,557ms	24ms
4× NVIDIA A100 GPUs (BF16)	553 tok/s	6,694ms	51ms

allenai/Olmo-Hybrid-Instruct-DPO-7B

Lambda — Fri, 06 Mar 2026 21:49:02 GMT

TL;DR: token throughput on vLLM

Hardware	Gen. throughput	TTFT	ITL
1× NVIDIA B200 GPU	1,765 tok/s	4,424ms	14ms
1× NVIDIA H100 GPU	1,066 tok/s	4,665ms	25ms
1× NVIDIA A100 GPU	551 tok/s	7,191ms	51ms