llama : remove quantization sanity check #17788

danbev · 2025-12-05T10:18:54Z

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers. For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

ggerganov · 2025-12-05T13:13:14Z

With this patch, Qwen3 Next fails:

llama_model_quantize_impl: n_layer_all = 48, n_attention_wv = 12, n_layer_recr = 36, n_expert_layers = 48, pruned_attention_w = 0
llama.cpp/src/llama-quant.cpp:746: GGML_ASSERT((n_layer_accounted == n_layer_all) && "layer count mismatch: attention + expert + recurrent + pruned != total layers") failed

I think we should just remove this sanity check - it's not really useful anyway.

This commit removes the quantization sanity check for attention layers. The motivation for this is that there are model that are hybrid models that have recurrent layers, experts layers, and attention layers. For these models the current check fails as the experts layers are not taking into account. After consideration, it was decided that this check is not strictly necessary, and can be removed to allow for more flexible model architectures.

* llama : remove quantization sanity check This commit removes the quantization sanity check for attention layers. The motivation for this is that there are model that are hybrid models that have recurrent layers, experts layers, and attention layers. For these models the current check fails as the experts layers are not taking into account. After consideration, it was decided that this check is not strictly necessary, and can be removed to allow for more flexible model architectures. * llama : remove unused pruned_attention_w and is_clip_model vars

danbev requested a review from ggerganov as a code owner December 5, 2025 10:18

loci-dev mentioned this pull request Dec 5, 2025

UPSTREAM PR #17788: llama : include n_experts in quantization sanity check auroralabs-loci/llama.cpp#451

Open

danbev force-pushed the llama-quants-sanity-check-experts branch from e4c5646 to 0ecbb8c Compare December 5, 2025 13:21

danbev changed the title ~~llama : include n_experts in quantization sanity check~~ llama : remove quantization sanity check Dec 5, 2025

llama : remove unused pruned_attention_w and is_clip_model vars

00c5a71

ggerganov approved these changes Dec 6, 2025

View reviewed changes

danbev merged commit 444f00b into ggml-org:master Dec 6, 2025
78 checks passed

danbev deleted the llama-quants-sanity-check-experts branch December 7, 2025 07:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama : remove quantization sanity check #17788

llama : remove quantization sanity check #17788

danbev commented Dec 5, 2025 •

edited

Loading

Uh oh!

ggerganov commented Dec 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

llama : remove quantization sanity check #17788

llama : remove quantization sanity check #17788

Conversation

danbev commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danbev commented Dec 5, 2025 •

edited

Loading

ggerganov commented Dec 5, 2025 •

edited

Loading