Quantized dot products for CUDA mul mat vec #2067

JohannesGaessler · 2023-07-01T19:04:47Z

This PR aims to implement CUDA kernels for matrix vector multiplication that utilize dot products with quantized data instead of dequantizing the data on-the-fly. So far this is only implemented for q4_0. In order to get good performance integer intrinsics are used. Unfortunately these have very poor performance on Pascal cards so the current implementation with dequantization should be kept. For my RTX 3090 I found:

GPU	Model	Test	t/s master	t/s PR	Speedup
RTX 3090	7b q4_0	tg128	91.06	101.39	1.11
RTX 3090	13b q4_0	tg128	51.88	57.95	1.12
RTX 3090	33b q4_0	tg128	22.83	25.71	1.13

For master I used the option LLAMA_CUDA_DMMV_F16 which uses f16 intrinsics for the calculation. Since this option is also only beneficial on relatively new cards and seemingly inferior to integer intrinsics I would suggest that the f16 option be removed in favor of this implementation.

JohannesGaessler · 2023-07-01T19:07:28Z

I forgot: because I'm changing the way quantization is used in this PR I would like to prioritize it over #2043 and then think about how to approach the dequantization for that PR again.

slaren · 2023-07-01T20:21:40Z

This is probably not going to make much of a difference in practice, but the __syncthreads or __syncwarp before the warp shuffles shouldn't be necessary, since these already imply a sync (at least since they gained the _sync suffix, it wasn't always the case). See this for more details: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

All the participating threads must be synchronized for the collective operation to work correctly. Therefore, these primitives first synchronize the threads if they are not already synchronized.

You may also gain some additional performance if instead of quantizing the vector to DRAM, you do it to shared memory at the beginning of vec_dot_q. It should be small enough to fit, in most cases at least.

JohannesGaessler · 2023-07-01T20:30:51Z

You may also gain some additional performance if instead of quantizing the vector to DRAM, you do it to shared memory at the beginning of vec_dot_q. It should be small enough to fit, in most cases at least.

The problem is that the vector is loaded thousands of times by different blocks. So I think that dequantizing once and then writing the dequantized version to DRAM is faster than dequantizing thousands of times.

slaren · 2023-07-01T20:40:17Z

You could use one block only and a lot more threads, and compute each row in a different warp. That worked for me in some tests I have been doing with the attention, but these matrices/vectors are very small, and it may not work so well for other (larger) matrix-vector multiplications.

JohannesGaessler · 2023-07-01T20:53:09Z

If I remember correctly the maximum block size is 1024 threads/32 warps. So for 7b where the smallest matrix has 4096 rows that would still mean quantizing the vector 128 times. Currently the quantization to q8_0 takes up 2.7% of the runtime so I don't think doing that several times is going to be viable.

slaren · 2023-07-01T21:04:51Z

If you adjust the block and grid size so that all blocks can be executed simultaneously, it may not matter that you have to quantize in each block, since it will be done simultaneously anyway. That may not work if the number of blocks is higher than the capacity of the GPU, but in that case you can still compute multiple rows in each warp and adjust the number of blocks accordingly. The 3090 can execute 1536 threads per SM, so a block size of 768 to fit two blocks in each SM may work best.

Midaychi · 2023-07-02T02:41:51Z

Unfortunately these have very poor performance on Pascal cards so the current implementation with dequantization should be kept.

If you use fp32 based operations on pascal cards instead of fp16 it should have much better performance

JohannesGaessler · 2023-07-02T06:44:04Z

I'm not using any f16 intrinsics. The option for that is already on master. I'm using __vsub4 and __dp4a to do byte-wise subtractions and dot products on integers.

JohannesGaessler · 2023-07-02T21:45:53Z

I have implemented a kernel for q4_1 and to my surprise I've found that the performance is ~10% better than for q4_0. The reason seems to be that due to q4_1 having a size of 20 bits vs. the 18 bits of q4_0 it is possible to directly cast the pointer for the quants to int instead of having to resort to memcpy. Since I'm currently still using memcpy for q8_0 this implies that performance could be significantly improved by padding or reordering the q8_0 vector; I'll investigate.

More generally this may also mean that reordering the weights in some way may be of benefit after all.

JohannesGaessler · 2023-07-03T18:49:17Z

I pushed a version in which the vector is quantized to q8_1 (36 bytes) instead of q8_0 (34 bytes). This allows you to directly cast the quant int 8 pointers to int 32 pointers which is significantly faster. With this I get 123 t/s for q4_0 using an RTX 3090. Reordering the data so that the scales and quants are in two separate blocks seems to have slightly worse performance, presumably due to cache locality.

casper-hansen · 2023-07-03T22:51:22Z

I have implemented a kernel for q4_1 and to my surprise I've found that the performance is ~10% better than for q4_0. The reason seems to be that due to q4_1 having a size of 20 bits vs. the 18 bits of q4_0 it is possible to directly cast the pointer for the quants to int instead of having to resort to memcpy. Since I'm currently still using memcpy for q8_0 this implies that performance could be significantly improved by padding or reordering the q8_0 vector; I'll investigate.

More generally this may also mean that reordering the weights in some way may be of benefit after all.

GPTQ implements a reordering approach based on quantization error. Weights with the smallest error first and weights with largest error last.

Not sure if it’s possible to achieve in llama.cpp - side effect in GPTQ seemed to be performance issues.

JohannesGaessler · 2023-07-04T06:12:05Z

I don't mean changing the order of the weights itself, I mean changing the way the data is laid out for better memory alignment.

JohannesGaessler · 2023-07-04T12:47:28Z

I pushed implementations for q5_0, q5_1, and q8_0. I think I've done the low-hanging fruits in terms of performance so I think I'll focus on making the new features usable now. Since the integer intrinsics seem to rely on hardware implementations I think I'll enable them based on compute capability. Ideally I can just set two compute capabilities in cmake and it will automatically use the highest one that a particular GPU supports.

JohannesGaessler · 2023-07-04T13:50:04Z

@slaren do you think we should keep the dequantize_mul_mat_vec implementations using f16 intrinsics? They were slightly faster on recent NVIDIA cards but the integer intrinsics seem to be superior for those cases.

slaren · 2023-07-04T14:24:11Z

I think that can still be useful for f16 models, so I would say keep it.

JohannesGaessler · 2023-07-04T16:21:58Z

Alright, I now consider this ready to be merged. By default the new kernels are used (if the compute capability is high enough), the old DMMV kernels can still be used by setting LLAMA_CUDA_FORCE_DMMV. These are the final performance numbers on my system:

GPU	Model	Test	t/s master	t/s PR	Speedup
RTX 3090	7b q4_0	tg128	90.40	121.52	1.34
RTX 3090	13b q4_0	tg128	51.32	69.23	1.35
RTX 3090	33b q4_0	tg128	22.65	31.91	1.41
RTX 3090	7b q4_1	tg128	84.30	115.00	1.36
RTX 3090	7b q4_0	tg128	60.75	103.35	1.70
RTX 3090	7b q5_1	tg128	60.69	99.08	1.63
RTX 3090	7b q8_0	tg128	72.89	77.55	1.06

ggml-cuda.cu

CMakeLists.txt

Makefile

slaren · 2023-07-04T23:12:28Z

CMakeLists.txt

Since the lowest for the integer intrinsics is 70 in practice, I think this could be changed too, if only for clarity.

For single GPU I would agree but for multi GPU settings that would be an issue. If you were to combine e.g. a Pascal and an Ampere card you would want to use the integer intrinsics with the 8.6 Ampere card (but not the 6.1 Pascal card). The decision which implementation to use can be done at runtime by checking the compute capability per card but only if the integer intrinsics are available at compile time.

ggerganov

Nice speed-up! 🦙

My guess is that a similar approach for qmat x qmat should result in better performance than the existing mat x mat using cuBLAS.

mirek190 · 2023-07-05T19:06:12Z

is possible improve like that q_K models?

JohannesGaessler · 2023-07-05T19:44:51Z

It is very likely possible to apply the same techniques to q_K models. The reason I didn't do it is merely that the CUDA implementation for those was done very differently compared to the older quantization methods which use a template. So I would rather work out all of the details on the older quantization methods before I touch half a dozen different k-quant implementations.

mirek190 · 2023-07-05T20:04:38Z

I am asking because q_K4_m has very similar perplexity to q5_1 ... BUT 33B 63 layers model q5_1 we cannot put entirely on consumer GPU ( RTX 3090, 4090 with 24 GB ) on the other hand q_K4_m is fitting perfectly where I have 18.5T/s ... thinking I COULD get something close 30 T/s with 33B and q4K_m .... just OMG

JohannesGaessler · 2023-07-05T20:36:00Z

Sorry, but you'll just need to be patient.

LostRuins · 2023-07-07T16:02:31Z

Ever since this was merged, I am getting rubbish outputs when using CUDA (ref #2136).

The outputs are normal if GGML_CUDA_FORCE_DMMV is set to true, or if 0 layers are offloaded. Otherwise, it ranges from a mix of garbled tokens to just a single repeated token.

JohannesGaessler marked this pull request as ready for review July 4, 2023 16:20

JohannesGaessler force-pushed the cuda-dp4a branch from deae466 to 05ef24d Compare July 4, 2023 16:54

slaren reviewed Jul 4, 2023

View reviewed changes

ggml-cuda.cu Outdated Show resolved Hide resolved

slaren reviewed Jul 4, 2023

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

slaren reviewed Jul 4, 2023

View reviewed changes

Makefile Outdated Show resolved Hide resolved

slaren reviewed Jul 4, 2023

View reviewed changes

Quantized dot products for CUDA mul mat vec

681f180

JohannesGaessler force-pushed the cuda-dp4a branch from 05ef24d to 681f180 Compare July 5, 2023 06:55

slaren approved these changes Jul 5, 2023

View reviewed changes

JohannesGaessler merged commit 924dd22 into ggml-org:master Jul 5, 2023

ggerganov reviewed Jul 5, 2023

View reviewed changes

LostRuins mentioned this pull request Jul 7, 2023

Generating garbage output on CUDA when GGML_CUDA_FORCE_DMMV is set to false #2136

Closed

This was referenced Jul 8, 2023

dequantize + matrix multiplication CUDA kernels #2043

Closed

CUDA: Quantized matrix matrix multiplication #2160

Merged

JohannesGaessler mentioned this pull request Jul 12, 2023

CUDA mul mat vec q kernels for k-quants #2203

Merged

chu-tianxiang mentioned this pull request Jan 19, 2024

GGUF support vllm-project/vllm#1002

Closed

Quantized dot products for CUDA mul mat vec #2067

Quantized dot products for CUDA mul mat vec #2067

Uh oh!

Conversation

JohannesGaessler commented Jul 1, 2023

Uh oh!

JohannesGaessler commented Jul 1, 2023

Uh oh!

slaren commented Jul 1, 2023

Uh oh!

JohannesGaessler commented Jul 1, 2023

Uh oh!

slaren commented Jul 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 1, 2023

Uh oh!

slaren commented Jul 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Midaychi commented Jul 2, 2023

Uh oh!

JohannesGaessler commented Jul 2, 2023

Uh oh!

JohannesGaessler commented Jul 2, 2023

Uh oh!

JohannesGaessler commented Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casper-hansen commented Jul 3, 2023

Uh oh!

JohannesGaessler commented Jul 4, 2023

Uh oh!

JohannesGaessler commented Jul 4, 2023

Uh oh!

JohannesGaessler commented Jul 4, 2023

Uh oh!

slaren commented Jul 4, 2023

Uh oh!

JohannesGaessler commented Jul 4, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Jul 4, 2023

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 5, 2023

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

mirek190 commented Jul 5, 2023

Uh oh!

JohannesGaessler commented Jul 5, 2023

Uh oh!

mirek190 commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 5, 2023

Uh oh!

LostRuins commented Jul 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

slaren commented Jul 1, 2023 •

edited

Loading

slaren commented Jul 1, 2023 •

edited

Loading

JohannesGaessler commented Jul 3, 2023 •

edited

Loading

mirek190 commented Jul 5, 2023 •

edited

Loading