feat: perf opt gemv #54

chraac · 2025-08-05T17:02:37Z

Related to #51
Related to #34

Overview

This PR implements performance optimizations for the Generalized Matrix-Vector Multiplication (GEMV) operations in llama.cpp. GEMV is a critical operation in the inference pipeline of large language models, especially during context processing and token generation.

Changes

Improved cache utilization through better memory access patterns
Implemented vectorized operations using SIMD instructions where applicable
Reduced unnecessary memory allocations and copies
Optimized loop structures for better compiler auto-vectorization

Performance Impact

Test setup: 8gen2
Test suite: test-backend-ops
Baseline: 6260c3166
Optimized: 471ab1d87

Key Findings - GEMV Operations (n=1)

Matrix Type	Baseline Performance	Optimized Performance	Improvement
f32	8.23 GFLOPS	8.79 GFLOPS	+6.8%
f16	12.42 GFLOPS	12.99 GFLOPS	+4.6%
q4_0	16.33 GFLOPS	16.69 GFLOPS	+2.2%
q4_K	195.01 MFLOPS	195.65 MFLOPS	+0.3%

Summary

The GEMV optimization demonstrates meaningful performance improvements for the n=1 case, which is the specific target of this optimization:

F32 format shows best improvement: With a 6.8% speedup for matrix-vector operations, the optimization has the most significant impact on full-precision calculations
F16 format also benefits: The half-precision format shows a solid 4.6% performance improvement
Quantized formats see modest gains: The q4_0 format shows a 2.2% improvement, while q4_K sees minimal but still positive impact

This optimization is particularly beneficial for pure GEMV operations with n=1, which are critical for efficient token-by-token inference in large language models. The performance gains are consistent across different precision formats, with the most substantial improvements observed in full-precision calculations.

Log:

test-backend-ops-perf-all.release.hexagon.471ab1d87.log
test-backend-ops-perf-all.release.hexagon.6260c3166.log

Unit tests

Test setup: 8gen2
Test suite: test-backend-ops

  8133/8133 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m
Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops-all.release.hexagon.471ab1d87.7z

…exagon

… multiplication

…multiplication

…ations

…ocessing

…d improved GLU implementations

…g in thread pool

…tion

…ple source tensors

… indexing in thread pool" This reverts commit 40e3f09.

Copilot

Pull Request Overview

This PR implements performance optimizations for Generalized Matrix-Vector Multiplication (GEMV) operations in llama.cpp's QNN NPU backend. The optimization targets the critical n=1 case for efficient token-by-token inference in large language models, showing performance improvements of 2-7% across different precision formats.

Adds GLU operation support with SWIGLU activation implementation
Improves GEMV-specific matrix multiplication path with optimized memory access patterns
Reorganizes headers and adds vectorized math operations for better code structure

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl	Adds GLU operation enum and tensor operation spec structure
ggml/src/ggml-qnn/npu/host/util.cpp	Adds GLU operation mapping to NPU operations
ggml/src/ggml-qnn/npu/host/host_device.cpp	Updates device support checking to use new operation spec structure
ggml/src/ggml-qnn/npu/device/vec_.	Implements vectorized math operations and reorganizes template definitions
ggml/src/ggml-qnn/npu/device/op_.	Updates operation implementations to support new GLU operations and GEMV optimization
ggml/src/ggml-qnn/npu/device/type_traits.*	Updates type definitions and improves source caching logic

Copilot · 2025-08-07T16:49:51Z

ggml/src/ggml-qnn/npu/device/vec_math.inl

@@ -0,0 +1,1129 @@
+#pragma once


This new 1100+ line file lacks a header comment explaining its purpose, scope, and relationship to other vector math components. Consider adding documentation explaining that this contains Hexagon-specific vectorized math implementations.

Copilot · 2025-08-07T16:49:51Z

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp

+bool is_mul_mat_f16_f32_src_tensors_aligned(hexagon::tensor * src0,
+                                            hexagon::tensor * src1,
+                                            bool              is_src0_cached,
+                                            bool              is_src1_cached) {


The function signature has been changed to include is_src0_cached and is_src1_cached parameters, but the function name still suggests it only checks alignment. This could be misleading to callers who expect alignment checking only. Consider renaming to reflect the broader purpose or splitting into separate functions.

Suggested change

bool is_mul_mat_f16_f32_src_tensors_aligned(hexagon::tensor * src0,

hexagon::tensor * src1,

bool is_src0_cached,

bool is_src1_cached) {

bool is_mul_mat_f16_f32_src_tensors_accessible_and_aligned(hexagon::tensor * src0,

hexagon::tensor * src1,

bool is_src0_cached,

bool is_src1_cached) {

ggml/src/ggml-qnn/npu/device/op_flash_attn.cpp

Copilot · 2025-08-07T16:49:52Z

ggml/src/ggml-qnn/npu/device/op_impl.cpp

+        return false;
+    }
+
+    return false;  // TODO: fix: for some input hexagon intrinsics will generate nan instead of inf.


The GLU operation is implemented but always returns false due to a TODO comment about NaN/Inf handling. This makes the implementation non-functional. Either fix the underlying issue or remove the GLU support until it can be properly implemented.

Copilot · 2025-08-07T16:49:52Z

ggml/src/ggml-qnn/npu/device/thread_pool.hpp

-constexpr const size_t             kMaxThreadCount       = 4;
-constexpr const size_t             kDefaultStackSize     = 1024 * 32;  // 32KB
-constexpr const unsigned long long kThreadTaskPendingBit = 1;
+constexpr const size_t kMaxThreadCount   = 4;


[nitpick] The thread stack size has been doubled from 32KB to 64KB. While this may be necessary for the new math operations, it increases memory usage per thread. Consider documenting why this increase was necessary or making it configurable based on the operations being performed.

Suggested change

constexpr const size_t kMaxThreadCount = 4;

constexpr const size_t kMaxThreadCount = 4;

// Default thread stack size set to 64KB to support new math operations that require larger stack frames.

// If your workload does not require such large stack frames, you may reduce this value to save memory.

chraac added 30 commits July 23, 2025 00:46

add GEMV implementation for matrix multiplication in hexagon

1ee1e38

refactor: optimize GEMV implementation for matrix multiplication in h…

57d0216

…exagon

wip

c4959d1

refactor: enhance caching mechanism in GEMV implementation for matrix…

6d20ed5

… multiplication

wip

bec79b8

refactor: streamline caching logic in GEMV implementation for matrix …

54bd2c1

…multiplication

wip

a25570f

wip

c7570ad

fix broadcase in flash_attn

115b5f4

format

51b0802

refactor: optimize memory fetching in matrix multiplication implement…

b0fd002

…ations

wip

677f054

fix aligned gemv

72fc36a

rename

a4ac205

refactor: remove unused memory cache functions and initialize VTCM cache

ff4c9dc

wip

c3630af

feat: add vector math functions for IEEE float and half float operations

0cb3817

feat: add vec_silu_f32 and vec_silu_f16 functions for SiLU activation

99fd1e1

feat: implement GLU operation support in tensor processing

1292eb1

feat: add GLU operation support and related enhancements in tensor pr…

51eb64d

…ocessing

wip

a8e51b4

wip

4dda722

wip

adc7193

feat: add qhmath_hvx_div_vf functions for f32 vector operations

f718191

feat: add qhmath_hvx_div_vhf functions for f16 vector operations

3bc03a5

fix: reorder parameters in vector operation functions for consistency

21a47eb

wip

ce9f563

feat: enhance vector operations with parameterized transformations an…

38c895b

…d improved GLU implementations

wip

d0ca969

fix: increase default stack size and correct thread parameter indexin…

40e3f09

…g in thread pool

chraac added 9 commits July 30, 2025 00:56

fix f16 div

479235a

fix f32 div

19fc8d3

fix: update GLU vector operations to use explicit denominator calcula…

be9484c

…tion

wip

9d8bd64

wip

8ff593b

Merge branch 'dev-refactoring' into dev-perf-opt-gemv

17992ff

Refactor cacheability check for matrix multiplication to handle multi…

bb7a536

…ple source tensors

Revert "fix: increase default stack size and correct thread parameter…

5c2bf5d

… indexing in thread pool" This reverts commit 40e3f09.

wip

471ab1d

chraac requested a review from Copilot August 5, 2025 17:02

chraac self-assigned this Aug 5, 2025

chraac added the enhancement New feature or request label Aug 5, 2025

This comment was marked as outdated.

Sign in to view

chraac mentioned this pull request Aug 5, 2025

feat: perf opt gemv chraac/llama-cpp-qnn-builder#19

Merged

chraac added this to hexagon-npu backend Aug 5, 2025

fix comments

58679c7

chraac requested a review from Copilot August 6, 2025 02:09

This comment was marked as outdated.

Sign in to view

chraac requested a review from Copilot August 7, 2025 16:48

Copilot AI reviewed Aug 7, 2025

View reviewed changes

replace copy with memcpy

3ffcff3

chraac merged commit 379bdeb into dev-refactoring Aug 8, 2025

github-project-automation bot moved this to Done in hexagon-npu backend Aug 8, 2025

chraac deleted the dev-perf-opt-gemv branch September 18, 2025 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: perf opt gemv #54

feat: perf opt gemv #54

Uh oh!

chraac commented Aug 5, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Copilot AI Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: perf opt gemv #54

feat: perf opt gemv #54

Uh oh!

Conversation

chraac commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

Performance Impact

Key Findings - GEMV Operations (n=1)

Summary

Unit tests

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chraac commented Aug 5, 2025 •

edited

Loading