Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


TL;DR See #5055
Before the recent two-bit quantization and importance matrix related changes, there were two low-bit quantization types available in
llama.cpp:Q2_KandQ3_K_S.Q2_Kwas basically a 3-bit quantization with just theattn_kandattn_qtensors quantized with 2 bit. The table shows their model sizes and perplexities (wiki.test.raw, n_ctx = 512) for LLaMA-v2-70B:After the recent changes,
Q2_Khas become an actual 2-bit quantization (less than 3 bits-per-weight), has a LLaMA-v-70B model size of 23.71 GiB, and a perplexity of4.0039(using an importance matrix derived fromwiki.train.raw).Q3_K_Shas increased very slightly to 27.86 GiB, but has a better perplexity of3.6603. Based on #5005 there is a need to have an intermediate step in terms of model size between the newQ2_KandQ3_K_S. This PR adds such a quantization type asQ3_K_XS. The following table summarizes the new situation for LLaMA-v2-70BThe table on a graph:
