Skip to content

Development Roadmap (2026 Q2) #22949

@merrymercy

Description

@merrymercy

SGLang Roadmap — 2026 Q2

Contributions and feedback are welcome. Join Slack.

Focus

  • Feature compatibility & reliability: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, hierarchical cache, and load balancing.
  • Usability: Easy installation on NV/AMD/TPU/CPU; simple large-scale deployment (k8s, OME).
  • Kernel optimization: For next-gen hardware (GB300/GB200, B300/B200, MI350/MI355, TPU).
  • Reinforcement learning: Framework integration and training-inference mismatch mitigation.
  • Multimodal: Enhance diffusion models for video, image and 3D generation. Omni model support.

Basic Feature Refactors and Improvements

Parallelism

Multimodal

  • Diffusion and Multimodal Generation
    Slack: #diffusion
    Issue: [Roadmap] SGLang-Diffusion (26 Q2) #23035

  • MLLM, VLM, and Multimodal Perception
    Slack: #multi-modal
    Issue: [Roadmap] Multimodal LLM (26 Q2) #23036

  • SGLang Omni
    PoC: @zhaochenyang20 @FrankLeeeee
    Repo: https://github.com/sgl-project/sglang-omni
    Currently supported: Fish Audio S2 Pro (Dual AR), Qwen3-Omni (Thinker-Talker).
    Q2 goals:

    • RFC #188 refactor: collapse Stage→Worker→Executor→Engine to Stage→Engine; ~33K → ~10K lines; request-path depth 8–10 → 6 with no accuracy/perf regression.
    • Day-zero serving for new Omni/audio-gen models: 3+ models at production quality; integration cost ~2 weeks → <1 week post-refactor.
    • Benchmark CI: extend the task × model framework (PR add cert functionality #223) with audio quality metrics, Video MMU, calibrated regression thresholds.
    • Production observability: per-stage latency breakdown, token-level tracing, audio quality monitoring.
    • Performance: generalize S2 Pro's CUDA Graph + torch.compile path (55.8 → 120 TPS) into a reusable abstraction; close Qwen3 Omni Talker gap.
    • Omni RL: expose rollout interface to Miles (joint with RL workstream).

Hardware

Kernels

  • Experiment with MegaKenrel integration
  • Move more kernels to JIT style
  • Integrate more communication-compute overlap kernels
  • Integrate more quantization kernels (nvfp4, mxfp8)

Reliability and Observability

  • Dumping tools for fixing cuda illegal memory access
  • Better per request tracing
  • Runtime memory pool check, PD transfer checksum, weight checksum

RL Framework Integration

  • Miles
    PoC: @yueming-yuan @fzyzcjy
    Repo: https://github.com/radixark/miles
    Landed: Unified FP8 E2E (blog); R3 routing replay for MoE (paper); INT4 QAT closed loop (blog); speculative RL with online SFT draft; zero-copy CUDA IPC weight sync; TIS/MIS off-policy correction; VLM multi-turn; MrlX multi-agent.
    Q2 goals: Zero mismatch for MoE RL; SGLang↔Megatron parity for MoE (TP/EP/PP); Diffusion / Omni / dLLM RL via shared rollout interface; elastic rollout-vs-training scheduling.

  • slime, verl, AReaL
    PoC @zhaochenyang20
    slime, verl, AReaL — Maintain SGLang as a first-class rollout backend across the major external RL frameworks. (slime) is the upstream Miles tracks and the reference for SGLang-native recipes. (verl) is the industry-adopted Volcano Engine framework. (AReaL) is the async RL framework from Ant / Tsinghua.
    Goals: converge on one stable SGLang rollout-engine API to cut per-framework drift on weight sync, sampling, and logprob semantics; upstream shared primitives (R3, FP8, deterministic inference, TIS/MIS) so all four frameworks benefit together.

Multi-LoRA Serving

Model Coverage

CI / Release / Maintenance



TODO: This is still WIP. More sections will be added.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions