Development Roadmap (2026 Q2)

# SGLang Roadmap — 2026 Q2

*Contributions and feedback are welcome*. [Join Slack](https://slack.sglang.ai).

## Focus

- **Feature compatibility & reliability**: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, hierarchical cache, and load balancing.
- **Usability**: Easy installation on NV/AMD/TPU/CPU; simple large-scale deployment (k8s, OME).
- **Kernel optimization**: For next-gen hardware (GB300/GB200, B300/B200, MI350/MI355, TPU).
- **Reinforcement learning**: Framework integration and training-inference mismatch mitigation.
- **Multimodal**: Enhance diffusion models for video, image and 3D generation. Omni model support.

## Basic Feature Refactors and Improvements

- **Scheduler refactor**
  PoC: @hnyls2002  
  Slack: [#dev](https://sgl-fru7574.slack.com/archives/C095B2L7UEB), [#spec-decoding](https://sgl-fru7574.slack.com/archives/C09KELDAD8U)  
  Goals: Make forward-mode more general; Make the scheduler more stateless; Fully support mixed chunked prefill; Hide CPU overhead in scheduler for all cases; Move more preparation into cuda graph. https://github.com/sgl-project/sglang/issues/11762

- **KV Cache management**
  PoC: @ispobock @hzh0425 @xiezhq-hermann 
  Slack: [#kv-cache-store](https://sgl-fru7574.slack.com/archives/C095B2L7UEB), [#hybrid-model](https://sgl-fru7574.slack.com/archives/C0AAC18MHQQ)  
  Goals: Make hierarchical cache and hybrid attention the native feature; Support flexible session control for agentic workloads
    - https://github.com/sgl-project/sglang/issues/21846
    - https://github.com/sgl-project/sglang/issues/20415 
    - https://github.com/sgl-project/sglang/pull/13581 

- **Speculative decoding**
  PoC: @Qiaolin-Yu 
  Slack: [#spec-decoding](https://sgl-fru7574.slack.com/archives/C09KELDAD8U)  
  Goals: General abstraction for more spec algorithms; General abstraction for spec graph preparation & init; Adaptive spec configurations for different requests and batch sizes.
    - #23005

- **PD disaggregation**
  PoC: @ShangmingCai
  Slack: [#pd-disaggregation](https://sgl-fru7574.slack.com/archives/C08AP4WU8P3)
  Goals: https://github.com/sgl-project/sglang/issues/21703 

- **API Server**
  PoC: @alexnails
  Slack: [#sglang-grpc-rfc-22558](https://sgl-fru7574.slack.com/archives/C0ATAE18PS8) [#rust-migration](https://sgl-fru7574.slack.com/archives/C0ATDERV4BS)
  Goals: https://github.com/sgl-project/sglang/issues/22558

- **Rust migration**
  PoC: @ishandhanani @rainj-me 
  Slack: [#rust-migration](https://sgl-fru7574.slack.com/archives/C0ATDERV4BS)
  Goals: Gradually rewrite most components (scheduler, api server, prefix tree) in Rust

- **Cuda graph runner backend**
  PoC: @Oasis-Git
  Slack: [#piecewise-cuda-graph](https://sgl-fru7574.slack.com/archives/C09KZ1MV013)
  Goals: Support flexible cuda graph backends (decode, prefill) x (full, breakable, torch-compile-based pcg); Enable breakable cuda graph for prefill by default. https://github.com/sgl-project/sglang/issues/23004
 
## Parallelism

- **Pipeline parallelism** refactor for long-context prefill and high-throughput decoding  
  PoC: @ShangmingCai
  Slack: [#pipeline-parallel](https://sgl-fru7574.slack.com/archives/C09J7BY42PP)  
  Issue: https://github.com/sgl-project/sglang/issues/11857

- **Expert parallelism**  
  Slack: [#expert-parallel](https://sgl-fru7574.slack.com/archives/C09QRUHFJTE)  
  Issue: https://github.com/sgl-project/sglang/issues/19650, https://github.com/sgl-project/sglang/issues/8715  
  Elastic parallel PRs: https://github.com/sgl-project/sglang/pull/10423, https://github.com/sgl-project/sglang/pull/11837

- **Data parallelism attention** refactor  
  Issue: https://github.com/sgl-project/sglang/issues/16080

- **Context parallelism**
  PoC: @Fridge003 @kpham-sgl  @ShangmingCai @ch-wan 
  Slack: [#context-parallel](https://sgl-fru7574.slack.com/archives/C0ANLRB8QK0)
  Issue: https://github.com/sgl-project/sglang/issues/21788

- **Distributed Weight Data Parallelism**
  PoC: @yuhao318 
  Issue: https://github.com/sgl-project/sglang/issues/22084

- **GB200/GB300 NVL72 optimizations**  
  PoC: @Fridge003 
  Slack: [#deepseek-large-scale-serving](https://sgl-fru7574.slack.com/archives/C08QGMU93GX)
  Issue: https://github.com/sgl-project/sglang/issues/19650


## Multimodal

- **Diffusion and Multimodal Generation**
  Slack: [#diffusion](https://sgl-fru7574.slack.com/archives/C09P0HTKE6A)
  Issue: #23035

- **MLLM, VLM, and Multimodal Perception**
  Slack: [#multi-modal](https://sgl-fru7574.slack.com/archives/C087RGPBC81)
  Issue: #23036

-  **SGLang Omni**
  PoC: @zhaochenyang20 @FrankLeeeee 
  Repo: https://github.com/sgl-project/sglang-omni
  Currently supported: Fish Audio S2 Pro (Dual AR), Qwen3-Omni (Thinker-Talker).
  Q2 goals:
    - **[RFC #188 refactor](https://github.com/sgl-project/sglang-omni/issues/188)**: collapse Stage→Worker→Executor→Engine to Stage→Engine; ~33K → ~10K lines; request-path depth 8–10 → 6 with no accuracy/perf regression.
    - **Day-zero serving for new Omni/audio-gen models**: 3+ models at production quality; integration cost ~2 weeks → <1 week post-refactor.
    - **Benchmark CI**: extend the `task × model` framework (PR #223) with audio quality metrics, Video MMU, calibrated regression thresholds.
    - **Production observability**: per-stage latency breakdown, token-level tracing, audio quality monitoring.
    - **Performance**: generalize S2 Pro's CUDA Graph + torch.compile path (55.8 → 120 TPS) into a reusable abstraction; close Qwen3 Omni Talker gap.
     - **Omni RL**: expose rollout interface to Miles (joint with RL workstream).

## Hardware
- **General multi-hardware abstraction**
  PoC: @alexnails 
  Issue: https://github.com/sgl-project/sglang/pull/21388

- **NVIDIA collaboration**
  Issue: https://github.com/sgl-project/sglang/issues/22960

- **AMD extension & Specification** on top of Q2 above  
  PoC: @HaiShaw 
  Issue: https://github.com/sgl-project/sglang/issues/23494

- **TPU** 
  TorchTPU based solution
  Jax-based solution: https://github.com/sgl-project/sglang-jax/issues/909
  Slack: [#dev-jax-tpu](https://sgl-fru7574.slack.com/archives/C09EBE5HT5X)

- **NPU**
  PoC: @iforgetmyname
  Issue: https://github.com/sgl-project/sglang/issues/25598

- **Intel CPU/XPU**: 
  - https://github.com/sgl-project/sglang/issues/24921
  - https://github.com/sgl-project/sglang/issues/24922

## Kernels
  - Experiment with MegaKenrel integration
  - Move more kernels to JIT style
  - Integrate more communication-compute overlap kernels
  - Integrate more quantization kernels (nvfp4, mxfp8)

## Reliability and Observability
 - Dumping tools for fixing cuda illegal memory access
 - Better per request tracing
 - Runtime memory pool check, PD transfer checksum, weight checksum

## RL Framework Integration

- **Miles** 
  PoC: @yueming-yuan  @fzyzcjy
  Repo: https://github.com/radixark/miles
  Landed: Unified FP8 E2E ([blog](https://lmsys.org/blog/2025-11-25-fp8-rl/)); R3 routing replay for MoE ([paper](https://arxiv.org/pdf/2510.11370)); INT4 QAT closed loop ([blog](https://lmsys.org/blog/2026-01-26-int4-qat/)); speculative RL with online SFT draft; zero-copy CUDA IPC weight sync; TIS/MIS off-policy correction; VLM multi-turn; MrlX multi-agent.
  Q2 goals: Zero mismatch for MoE RL; SGLang↔Megatron parity for MoE (TP/EP/PP); Diffusion / Omni / dLLM RL via shared rollout interface; elastic rollout-vs-training scheduling.

- **slime, verl, AReaL**
  PoC @zhaochenyang20 
  slime, verl, AReaL — Maintain SGLang as a first-class rollout backend across the major external RL frameworks. ([slime](https://github.com/THUDM/slime)) is the upstream Miles tracks and the reference for SGLang-native recipes. ([verl](https://github.com/volcengine/verl)) is the industry-adopted Volcano Engine framework. ([AReaL](https://github.com/inclusionAI/AReaL)) is the async RL framework from Ant / Tsinghua.
  Goals: converge on one stable SGLang rollout-engine API to cut per-framework drift on weight sync, sampling, and logprob semantics; upstream shared primitives (R3, FP8, deterministic inference, TIS/MIS) so all four frameworks benefit together.



## Multi-LoRA Serving
- More model support, improve perf and compatible with lora RL training in Miles
PoC: @yushengsu-thu 
Issue: https://github.com/sgl-project/sglang/issues/25095

## Model Coverage
- Day-0 model support for all major models  
  PoC: @wisclmy0611 @JustinTong0323  
  Slack: [#dev](https://sgl-fru7574.slack.com/archives/C07PEP77X6F)

## CI / Release / Maintenance
- Improve stability, increase coverage, and reduce flakiness. 
  PoC: @alisonshao @Kangyan-Zhou 
  Slack: [#ci-cd-build-release](https://sgl-fru7574.slack.com/archives/C09HCG2HM1T)
  Issue: https://github.com/sgl-project/sglang/issues/21157, https://github.com/sgl-project/sglang/issues/20865, https://github.com/sgl-project/sglang/issues/20847

 - Mock model for correctness tests

------
------

TODO: This is still WIP. More sections will be added.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Roadmap (2026 Q2) #22949

SGLang Roadmap — 2026 Q2

Focus

Basic Feature Refactors and Improvements

Parallelism

Multimodal

Hardware

Kernels

Reliability and Observability

RL Framework Integration

Multi-LoRA Serving

Model Coverage

CI / Release / Maintenance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Development Roadmap (2026 Q2) #22949

Description

SGLang Roadmap — 2026 Q2

Focus

Basic Feature Refactors and Improvements

Parallelism

Multimodal

Hardware

Kernels

Reliability and Observability

RL Framework Integration

Multi-LoRA Serving

Model Coverage

CI / Release / Maintenance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions