Skip to content

[RFC]: Deprecate Legacy Quantization Formats #30136

@robertgshaw2-redhat

Description

@robertgshaw2-redhat

Motivation.

  • vLLM supports a large variety of quantization formats. This is hard to maintain and makes the codebase complex
  • Many mature frameworks (llm-compressor, modelopt, quark, torchao) have emerged which are general purpose implementations of various quantization schemes
  • we have limited usage of older formats per usage stats

Proposed Change.

  • deprecate many of the legacy formats

Kept:

  • compressed-tensors
  • quark
  • awq.py (to be deprecated later, too many models exist though --- autoawq no longer maintained)
  • bitsandbytes.py
  • fp8.py
  • quark
  • mxfp4.py
  • modelopt.py
  • gguf.py
  • gptq.py (to be deprecated later, too many models exists though) --- autogptq no longer maintained)
  • torchao.py

Proposed to be removed (per usage stats):

  • auto_round
  • awq_marlin (consolidate to awq.py)
  • awq_triton (consolidate to awq.py)
  • bitblas.py
  • cpu_wna16.py
  • deepseepfp8.py
  • experts_int8.py
  • fbgemm_fp8.py
  • fp_quant.py
  • gptq_bitblas.py
  • gptq_marlin.py (consolidate to gptq.py)
  • gptq_marlin_24.py
  • hqq_marlin.py
  • inc.py
  • input_quant_fp8.py
  • ipex_quant.py
  • moe_wna16.py
  • petit.py
  • ptpc_fp8.py
  • rtn.py
  • tpu_int8.py

Feedback Period.

2 Weeks

CC List.

@mgoin @pavanimajety

Any Other Things.

The goal is to clean up the codebase:

  • reduce mental load
  • reduce complexity of implementing features (e.g. FusedMoE refactor)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions