-
Notifications
You must be signed in to change notification settings - Fork 5
feat: perf opt gemv #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d improved GLU implementations
…ple source tensors
… indexing in thread pool" This reverts commit 40e3f09.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements performance optimizations for Generalized Matrix-Vector Multiplication (GEMV) operations in llama.cpp's QNN NPU backend. The optimization targets the critical n=1 case for efficient token-by-token inference in large language models, showing performance improvements of 2-7% across different precision formats.
- Adds GLU operation support with SWIGLU activation implementation
- Improves GEMV-specific matrix multiplication path with optimized memory access patterns
- Reorganizes headers and adds vectorized math operations for better code structure
Reviewed Changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl | Adds GLU operation enum and tensor operation spec structure |
| ggml/src/ggml-qnn/npu/host/util.cpp | Adds GLU operation mapping to NPU operations |
| ggml/src/ggml-qnn/npu/host/host_device.cpp | Updates device support checking to use new operation spec structure |
| ggml/src/ggml-qnn/npu/device/vec_. | Implements vectorized math operations and reorganizes template definitions |
| ggml/src/ggml-qnn/npu/device/op_. | Updates operation implementations to support new GLU operations and GEMV optimization |
| ggml/src/ggml-qnn/npu/device/type_traits.* | Updates type definitions and improves source caching logic |
| @@ -0,0 +1,1129 @@ | |||
| #pragma once | |||
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new 1100+ line file lacks a header comment explaining its purpose, scope, and relationship to other vector math components. Consider adding documentation explaining that this contains Hexagon-specific vectorized math implementations.
| bool is_mul_mat_f16_f32_src_tensors_aligned(hexagon::tensor * src0, | ||
| hexagon::tensor * src1, | ||
| bool is_src0_cached, | ||
| bool is_src1_cached) { |
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature has been changed to include is_src0_cached and is_src1_cached parameters, but the function name still suggests it only checks alignment. This could be misleading to callers who expect alignment checking only. Consider renaming to reflect the broader purpose or splitting into separate functions.
| bool is_mul_mat_f16_f32_src_tensors_aligned(hexagon::tensor * src0, | |
| hexagon::tensor * src1, | |
| bool is_src0_cached, | |
| bool is_src1_cached) { | |
| bool is_mul_mat_f16_f32_src_tensors_accessible_and_aligned(hexagon::tensor * src0, | |
| hexagon::tensor * src1, | |
| bool is_src0_cached, | |
| bool is_src1_cached) { |
| return false; | ||
| } | ||
|
|
||
| return false; // TODO: fix: for some input hexagon intrinsics will generate nan instead of inf. |
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GLU operation is implemented but always returns false due to a TODO comment about NaN/Inf handling. This makes the implementation non-functional. Either fix the underlying issue or remove the GLU support until it can be properly implemented.
| constexpr const size_t kMaxThreadCount = 4; | ||
| constexpr const size_t kDefaultStackSize = 1024 * 32; // 32KB | ||
| constexpr const unsigned long long kThreadTaskPendingBit = 1; | ||
| constexpr const size_t kMaxThreadCount = 4; |
Copilot
AI
Aug 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The thread stack size has been doubled from 32KB to 64KB. While this may be necessary for the new math operations, it increases memory usage per thread. Consider documenting why this increase was necessary or making it configurable based on the operations being performed.
| constexpr const size_t kMaxThreadCount = 4; | |
| constexpr const size_t kMaxThreadCount = 4; | |
| // Default thread stack size set to 64KB to support new math operations that require larger stack frames. | |
| // If your workload does not require such large stack frames, you may reduce this value to save memory. |
Related to #51
Related to #34
Overview
This PR implements performance optimizations for the Generalized Matrix-Vector Multiplication (GEMV) operations in llama.cpp. GEMV is a critical operation in the inference pipeline of large language models, especially during context processing and token generation.
Changes
Performance Impact
Test setup: 8gen2
Test suite:
test-backend-opsBaseline:
6260c3166Optimized:
471ab1d87Key Findings - GEMV Operations (n=1)
Summary
The GEMV optimization demonstrates meaningful performance improvements for the n=1 case, which is the specific target of this optimization:
F32 format shows best improvement: With a 6.8% speedup for matrix-vector operations, the optimization has the most significant impact on full-precision calculations
F16 format also benefits: The half-precision format shows a solid 4.6% performance improvement
Quantized formats see modest gains: The q4_0 format shows a 2.2% improvement, while q4_K sees minimal but still positive impact
This optimization is particularly beneficial for pure GEMV operations with n=1, which are critical for efficient token-by-token inference in large language models. The performance gains are consistent across different precision formats, with the most substantial improvements observed in full-precision calculations.
Log:
test-backend-ops-perf-all.release.hexagon.471ab1d87.log
test-backend-ops-perf-all.release.hexagon.6260c3166.log
Unit tests
Test setup: 8gen2
Test suite:
test-backend-opsFull log:
test-backend-ops-all.release.hexagon.471ab1d87.7z