Skip to content

Releases: ggml-org/llama.cpp

b7315

07 Dec 22:07
4d37262

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

model: add llama 4 scaling for mistral-large (deepseek arch) (#17744)

macOS/iOS:

Linux:

Windows:

b7314

07 Dec 18:01
08f9d3c

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

Vulkan: improve mul_mat_vec_iq1_m (#16907)

  • Optimize Vulkan shader for matrix-vector multiplication

  • Revert changes on compute_outputs and main

Refactor compute_outputs to handle remaining rows correctly.

  • Fix trailing whitespace

macOS/iOS:

Linux:

Windows:

b7313

07 Dec 14:44
0a540f9

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

ci : add windows-cuda 13.1 release (#17839)

macOS/iOS:

Linux:

Windows:

b7312

07 Dec 03:18
2257758

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

common : change --color to accept on/off/auto, default to auto (#17827)

macOS/iOS:

Linux:

Windows:

b7311

07 Dec 02:36
d9e03db

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

sycl: add missing BF16 conversion support for Intel oneAPI (#17780)

  • sycl: add missing BF16 conversion support for Intel oneAPI

  • Fix Line 645: Trailing whitespace

macOS/iOS:

Linux:

Windows:

b7310

06 Dec 21:38
db97837

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

vulkan: perf_logger improvements (#17672)

  • vulkan: perf_logger improvements
  • Move perf_logger from device to ctx.
  • Add an env var to control the frequency we dump the stats. If you set a very
    large value, it just dumps when the ctx is destroyed.
  • Add a fusion info string to the tracking, only log one item per fused op.
  • Fix MUL_MAT_ID flops calculation.
  • fix vector sizes

macOS/iOS:

Linux:

Windows:

b7307

06 Dec 19:30
09c7c50

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

ggml : add circular tiling support to pad, for Vulkan, CUDA, and CPU (used for making seamless textures) (#16985)

  • Feat: Added vulkan circular tiling support

  • Feat: Added cpu circular

  • Feat: Added cuda kernels

  • Added tests

  • Added tests

  • Removed non-pad operations

  • Removed unneded changes

  • removed backend non pad tests

  • Update test-backend-ops.cpp

  • Fixed comment on pad test

  • removed trailing whitespace

  • Removed unneded test in test-backend-ops

  • Removed removed test from calls

  • Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp

Co-authored-by: Ruben Ortlam [email protected]

  • Fixed alignment

  • Formatting

Co-authored-by: Aman Gupta [email protected]

  • Format pad

  • Format

  • Clang format

  • format

  • format

  • don't change so much stuff

  • clang format and update to bool

  • fix duplicates

  • don't need to fix the padding

  • make circular bool

  • duplicate again

  • rename vulkan to wrap around

  • Don't need indent

  • moved to const expr

  • removed unneded extra line break

  • More readable method calls

  • Minor wording changes

  • Added final newline

  • Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov [email protected]

  • Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov [email protected]

  • Added circular pad ext tests

  • Gate non circular pad devices

  • Cleaned gating of non-circular pad devices


Co-authored-by: Phylliida [email protected]
Co-authored-by: Ruben Ortlam [email protected]
Co-authored-by: Aman Gupta [email protected]
Co-authored-by: Georgi Gerganov [email protected]

macOS/iOS:

Linux:

Windows:

b7306

06 Dec 15:08
f334b79

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

HIP: fix RDNA3 FP16/BF16 matrix multiplication (#17817)

macOS/iOS:

Linux:

Windows:

b7302

06 Dec 13:50
7b43f55

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

ggml : improve error handling for search path existence checks (#17653)

  • Improve error handling for search path existence checks

Refactor existence checks for search paths using std::error_code to handle potential errors.

  • Improve cache file existence check with error code

Update fs::exists to use std::error_code for error handling.

  • Simplify existence check for search paths

Simplify existence check for search paths

  • Fix logging path in error message for posix_stat

  • Update ggml/src/ggml-backend-reg.cpp

Co-authored-by: Aman Gupta [email protected]

  • Adapt to the coding standard

Co-authored-by: Aman Gupta [email protected]

macOS/iOS:

Linux:

Windows:

b7301

06 Dec 13:16
444f00b

Choose a tag to compare

Warning

Release Format Update: Linux releases will soon use .tar.gz archives instead of .zip. Please make the necessary changes to your deployment scripts.

llama : remove quantization sanity check (#17788)

  • llama : remove quantization sanity check

This commit removes the quantization sanity check for attention layers.

The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers. For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.

  • llama : remove unused pruned_attention_w and is_clip_model vars

macOS/iOS:

Linux:

Windows: