HIP: fix RDNA3 FP16/BF16 matrix multiplication #17817

JohannesGaessler · 2025-12-06T07:17:27Z

Fixes #17797 by simply adding an explicit RDNA4 requirement to MMF. @jiachengjason as outlined in https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#pull-requests-for-contributors--collaborators , please test changes to the CUDA/HIP backend for correctness using test-backend-ops.

Beinsezii

Full test-backend-ops green now on gfx1100

Beinsezii · 2025-12-06T07:51:15Z

though interesting according to llama-bench I'm down 40% on pp8192 now compared to before #17576. is it not using the rocWMMA path anymore?

Beinsezii · 2025-12-06T08:09:36Z

Looking at the build command in #17576 (comment) they have GGML_HIP_ROCWMMA_FATTN=OFF so I assume it was never tested ON, and it's no longer functional.

Beinsezii · 2025-12-06T08:28:48Z

yeah.

using a realistic workload
bin/llama-bench -m ~/.cache/llama.cpp/Beinsezii_Mistral-Small-3.2-24B-Instruct-2506-Q6F-Q8A-GGUF_mistral-small-3.2-24b-instruct-2506-q6f-q8a.gguf -fa 1 -p 8192 -n 512 -pg 16384,1024 -r 1

all mmq commits + this one https://github.com/Beinsezii/llama.cpp/tree/rdna3_perf_mmq
GGML_HIP_ROCWMMA_FATTN=OFF

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |        708.72 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.72 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        308.02 ± 0.00 |

GGML_HIP_ROCWMMA_FATTN=ON

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |        726.87 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.74 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        314.46 ± 0.00 |

all recent mmq commits reverted
https://github.com/Beinsezii/llama.cpp/tree/rdna3_perf
GGML_HIP_ROCWMMA_FATTN=ON

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |       1042.45 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.68 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        357.65 ± 0.00 |

Still think this should get merged first to stop the failures but maybe I should open a new issue for perf? I assume it'll be fixed by #17495 eventually since that will replace the rocWMMA path.

JohannesGaessler · 2025-12-06T10:28:17Z

Any differences you see should be from MMQ vs. rocBLAS. If you compile with GGML_CUDA_FORCE_CUBLAS=ON you can use rocBLAS unconditionally.

Beinsezii · 2025-12-06T10:40:50Z

Any differences you see should be from MMQ vs. rocBLAS. If you compile with GGML_CUDA_FORCE_CUBLAS=ON you can use rocBLAS unconditionally.

Rebuilt against this PR confirmed perf good with cublas. Might be worth making it the default again until the other PR is ready, people will really notice 1/3 of throughput gone.

arch-btw · 2025-12-06T13:33:32Z

I don't know if it was within the scope of this PR but building with GGML_HIP_ROCWMMA_FATTN=ON is still broken.

    HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

build.log

HIP: fix RDNA3 FP16/BF16 matrix multiplication

38d1f02

JohannesGaessler requested a review from am17an as a code owner December 6, 2025 07:17

am17an approved these changes Dec 6, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17817: HIP: fix RDNA3 FP16/BF16 matrix multiplication auroralabs-loci/llama.cpp#467

Open

Beinsezii approved these changes Dec 6, 2025

View reviewed changes

IIIIIllllIIIIIlllll mentioned this pull request Dec 6, 2025

Qwen3-Next --ubatch-size issue #17578

Open

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 6, 2025

JohannesGaessler merged commit f334b79 into ggml-org:master Dec 6, 2025
71 of 75 checks passed

z-vishal mentioned this pull request Dec 6, 2025

ggml-zendnn : add ZenDNN backend for AMD CPUs #17690

Merged

JayZenith pushed a commit to JayZenith/llama.cpp that referenced this pull request Dec 7, 2025

HIP: fix RDNA3 FP16/BF16 matrix multiplication (ggml-org#17817)

fb9adc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIP: fix RDNA3 FP16/BF16 matrix multiplication #17817

HIP: fix RDNA3 FP16/BF16 matrix multiplication #17817

JohannesGaessler commented Dec 6, 2025

Uh oh!

Beinsezii left a comment

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Uh oh!

arch-btw commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HIP: fix RDNA3 FP16/BF16 matrix multiplication #17817

HIP: fix RDNA3 FP16/BF16 matrix multiplication #17817

Conversation

JohannesGaessler commented Dec 6, 2025

Uh oh!

Beinsezii left a comment

Choose a reason for hiding this comment

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 6, 2025

Uh oh!

Beinsezii commented Dec 6, 2025

Uh oh!

Uh oh!

arch-btw commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Beinsezii commented Dec 6, 2025 •

edited

Loading