Skip to content

Conversation

@JohannesGaessler
Copy link
Collaborator

Fixes #17797 by simply adding an explicit RDNA4 requirement to MMF. @jiachengjason as outlined in https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#pull-requests-for-contributors--collaborators , please test changes to the CUDA/HIP backend for correctness using test-backend-ops.

Copy link
Contributor

@Beinsezii Beinsezii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full test-backend-ops green now on gfx1100

@Beinsezii
Copy link
Contributor

though interesting according to llama-bench I'm down 40% on pp8192 now compared to before #17576. is it not using the rocWMMA path anymore?

@Beinsezii
Copy link
Contributor

Looking at the build command in #17576 (comment) they have GGML_HIP_ROCWMMA_FATTN=OFF so I assume it was never tested ON, and it's no longer functional.

@Beinsezii
Copy link
Contributor

Beinsezii commented Dec 6, 2025

yeah.

using a realistic workload
bin/llama-bench -m ~/.cache/llama.cpp/Beinsezii_Mistral-Small-3.2-24B-Instruct-2506-Q6F-Q8A-GGUF_mistral-small-3.2-24b-instruct-2506-q6f-q8a.gguf -fa 1 -p 8192 -n 512 -pg 16384,1024 -r 1

all mmq commits + this one https://github.com/Beinsezii/llama.cpp/tree/rdna3_perf_mmq
GGML_HIP_ROCWMMA_FATTN=OFF

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |        708.72 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.72 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        308.02 ± 0.00 |

GGML_HIP_ROCWMMA_FATTN=ON

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |        726.87 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.74 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        314.46 ± 0.00 |

all recent mmq commits reverted
https://github.com/Beinsezii/llama.cpp/tree/rdna3_perf
GGML_HIP_ROCWMMA_FATTN=ON

| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |          pp8192 |       1042.45 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |           tg512 |         36.68 ± 0.00 |
| llama 13B Q8_0                 |  18.79 GiB |    23.57 B | ROCm       |  99 |  1 |  pp16384+tg1024 |        357.65 ± 0.00 |

Still think this should get merged first to stop the failures but maybe I should open a new issue for perf? I assume it'll be fixed by #17495 eventually since that will replace the rocWMMA path.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 6, 2025
@JohannesGaessler
Copy link
Collaborator Author

Any differences you see should be from MMQ vs. rocBLAS. If you compile with GGML_CUDA_FORCE_CUBLAS=ON you can use rocBLAS unconditionally.

@Beinsezii
Copy link
Contributor

Any differences you see should be from MMQ vs. rocBLAS. If you compile with GGML_CUDA_FORCE_CUBLAS=ON you can use rocBLAS unconditionally.

Rebuilt against this PR confirmed perf good with cublas. Might be worth making it the default again until the other PR is ready, people will really notice 1/3 of throughput gone.

@JohannesGaessler JohannesGaessler merged commit f334b79 into ggml-org:master Dec 6, 2025
71 of 75 checks passed
@arch-btw
Copy link
Contributor

arch-btw commented Dec 6, 2025

I don't know if it was within the scope of this PR but building with GGML_HIP_ROCWMMA_FATTN=ON is still broken.

    HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
    cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS=gfx1100 -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build --config Release -- -j 16

build.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: ROCm-compiled program outputs gibberish

4 participants