Skip to content

Conversation

@chraac
Copy link
Owner

@chraac chraac commented Aug 5, 2025

Related to #51
Related to #34

Overview

This PR implements performance optimizations for the Generalized Matrix-Vector Multiplication (GEMV) operations in llama.cpp. GEMV is a critical operation in the inference pipeline of large language models, especially during context processing and token generation.

Changes

  • Improved cache utilization through better memory access patterns
  • Implemented vectorized operations using SIMD instructions where applicable
  • Reduced unnecessary memory allocations and copies
  • Optimized loop structures for better compiler auto-vectorization

Performance Impact

Test setup: 8gen2
Test suite: test-backend-ops
Baseline: 6260c3166
Optimized: 471ab1d87

Key Findings - GEMV Operations (n=1)

Matrix Type Baseline Performance Optimized Performance Improvement
f32 8.23 GFLOPS 8.79 GFLOPS +6.8%
f16 12.42 GFLOPS 12.99 GFLOPS +4.6%
q4_0 16.33 GFLOPS 16.69 GFLOPS +2.2%
q4_K 195.01 MFLOPS 195.65 MFLOPS +0.3%

Summary

The GEMV optimization demonstrates meaningful performance improvements for the n=1 case, which is the specific target of this optimization:

  1. F32 format shows best improvement: With a 6.8% speedup for matrix-vector operations, the optimization has the most significant impact on full-precision calculations

  2. F16 format also benefits: The half-precision format shows a solid 4.6% performance improvement

  3. Quantized formats see modest gains: The q4_0 format shows a 2.2% improvement, while q4_K sees minimal but still positive impact

This optimization is particularly beneficial for pure GEMV operations with n=1, which are critical for efficient token-by-token inference in large language models. The performance gains are consistent across different precision formats, with the most substantial improvements observed in full-precision calculations.

Log:

test-backend-ops-perf-all.release.hexagon.471ab1d87.log
test-backend-ops-perf-all.release.hexagon.6260c3166.log

Unit tests

Test setup: 8gen2
Test suite: test-backend-ops

  8133/8133 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m
Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops-all.release.hexagon.471ab1d87.7z

chraac added 30 commits July 23, 2025 00:46
@chraac chraac requested a review from Copilot August 5, 2025 17:02
@chraac chraac self-assigned this Aug 5, 2025
@chraac chraac added the enhancement New feature or request label Aug 5, 2025

This comment was marked as outdated.

@chraac chraac requested a review from Copilot August 6, 2025 02:09

This comment was marked as outdated.

@chraac chraac requested a review from Copilot August 7, 2025 16:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements performance optimizations for Generalized Matrix-Vector Multiplication (GEMV) operations in llama.cpp's QNN NPU backend. The optimization targets the critical n=1 case for efficient token-by-token inference in large language models, showing performance improvements of 2-7% across different precision formats.

  • Adds GLU operation support with SWIGLU activation implementation
  • Improves GEMV-specific matrix multiplication path with optimized memory access patterns
  • Reorganizes headers and adds vectorized math operations for better code structure

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/idl/hexagon_npu.idl Adds GLU operation enum and tensor operation spec structure
ggml/src/ggml-qnn/npu/host/util.cpp Adds GLU operation mapping to NPU operations
ggml/src/ggml-qnn/npu/host/host_device.cpp Updates device support checking to use new operation spec structure
ggml/src/ggml-qnn/npu/device/vec_. Implements vectorized math operations and reorganizes template definitions
ggml/src/ggml-qnn/npu/device/op_. Updates operation implementations to support new GLU operations and GEMV optimization
ggml/src/ggml-qnn/npu/device/type_traits.* Updates type definitions and improves source caching logic

@@ -0,0 +1,1129 @@
#pragma once
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new 1100+ line file lacks a header comment explaining its purpose, scope, and relationship to other vector math components. Consider adding documentation explaining that this contains Hexagon-specific vectorized math implementations.

Copilot uses AI. Check for mistakes.
Comment on lines +448 to +451
bool is_mul_mat_f16_f32_src_tensors_aligned(hexagon::tensor * src0,
hexagon::tensor * src1,
bool is_src0_cached,
bool is_src1_cached) {
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature has been changed to include is_src0_cached and is_src1_cached parameters, but the function name still suggests it only checks alignment. This could be misleading to callers who expect alignment checking only. Consider renaming to reflect the broader purpose or splitting into separate functions.

Suggested change
bool is_mul_mat_f16_f32_src_tensors_aligned(hexagon::tensor * src0,
hexagon::tensor * src1,
bool is_src0_cached,
bool is_src1_cached) {
bool is_mul_mat_f16_f32_src_tensors_accessible_and_aligned(hexagon::tensor * src0,
hexagon::tensor * src1,
bool is_src0_cached,
bool is_src1_cached) {

Copilot uses AI. Check for mistakes.
return false;
}

return false; // TODO: fix: for some input hexagon intrinsics will generate nan instead of inf.
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GLU operation is implemented but always returns false due to a TODO comment about NaN/Inf handling. This makes the implementation non-functional. Either fix the underlying issue or remove the GLU support until it can be properly implemented.

Copilot uses AI. Check for mistakes.
constexpr const size_t kMaxThreadCount = 4;
constexpr const size_t kDefaultStackSize = 1024 * 32; // 32KB
constexpr const unsigned long long kThreadTaskPendingBit = 1;
constexpr const size_t kMaxThreadCount = 4;
Copy link

Copilot AI Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The thread stack size has been doubled from 32KB to 64KB. While this may be necessary for the new math operations, it increases memory usage per thread. Consider documenting why this increase was necessary or making it configurable based on the operations being performed.

Suggested change
constexpr const size_t kMaxThreadCount = 4;
constexpr const size_t kMaxThreadCount = 4;
// Default thread stack size set to 64KB to support new math operations that require larger stack frames.
// If your workload does not require such large stack frames, you may reduce this value to save memory.

Copilot uses AI. Check for mistakes.
@chraac chraac merged commit 379bdeb into dev-refactoring Aug 8, 2025
@chraac chraac deleted the dev-perf-opt-gemv branch September 18, 2025 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants