Skip to content

Conversation

@danbev
Copy link
Member

@danbev danbev commented Dec 5, 2025

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers. For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

@ggerganov
Copy link
Member

ggerganov commented Dec 5, 2025

With this patch, Qwen3 Next fails:

llama_model_quantize_impl: n_layer_all = 48, n_attention_wv = 12, n_layer_recr = 36, n_expert_layers = 48, pruned_attention_w = 0
llama.cpp/src/llama-quant.cpp:746: GGML_ASSERT((n_layer_accounted == n_layer_all) && "layer count mismatch: attention + expert + recurrent + pruned != total layers") failed

I think we should just remove this sanity check - it's not really useful anyway.

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers.  For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.
@danbev danbev force-pushed the llama-quants-sanity-check-experts branch from e4c5646 to 0ecbb8c Compare December 5, 2025 13:21
@danbev danbev changed the title llama : include n_experts in quantization sanity check llama : remove quantization sanity check Dec 5, 2025
@danbev danbev merged commit 444f00b into ggml-org:master Dec 6, 2025
78 checks passed
JayZenith pushed a commit to JayZenith/llama.cpp that referenced this pull request Dec 7, 2025
* llama : remove quantization sanity check

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers.  For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

* llama : remove unused pruned_attention_w and is_clip_model vars
@danbev danbev deleted the llama-quants-sanity-check-experts branch December 7, 2025 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants