-
Notifications
You must be signed in to change notification settings - Fork 14.1k
metal: enable ggml-alloc #2627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal: enable ggml-alloc #2627
Conversation
Make ggml-alloc work with concurrently dispatch.
9167c84 to
9b9905f
Compare
slaren
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot test it, but looks good.
Co-authored-by: slaren <[email protected]>
|
The Based on what I'm seeing, it fails when BLAS / Accelerate is enabled. |
|
NVM, without Metal it fails because of this: However, with Metal enabled, it should work I think. However, I get this error: $ ▶ LLAMA_METAL=1 make -j && ./perplexity -m models/70B-v2/ggml-model-q4_0.bin -f build/wikitext-2-raw/wiki.test.raw -ngl 1 -gqa 8
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
make: Nothing to be done for `default'.
main: build = 992 (0919a0f)
main: seed = 1692218180
llama.cpp: loading model from models/70B-v2/ggml-model-q4_0.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 4096
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_head_kv = 8
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 8
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 28672
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size = 0.21 MB
llama_model_load_internal: mem required = 37070.96 MB (+ 160.00 MB per state)
llama_new_context_with_model: kv self size = 160.00 MB
ggml_metal_init: allocating
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x130006dd0
ggml_metal_init: loaded kernel_add_row 0x130007510
ggml_metal_init: loaded kernel_mul 0x130007a50
ggml_metal_init: loaded kernel_mul_row 0x1300080a0
ggml_metal_init: loaded kernel_scale 0x1300085e0
ggml_metal_init: loaded kernel_silu 0x130008b20
ggml_metal_init: loaded kernel_relu 0x130009060
ggml_metal_init: loaded kernel_gelu 0x1300095a0
ggml_metal_init: loaded kernel_soft_max 0x130009c70
ggml_metal_init: loaded kernel_diag_mask_inf 0x13000a2f0
ggml_metal_init: loaded kernel_get_rows_f16 0x13000a9c0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x13000b200
ggml_metal_init: loaded kernel_get_rows_q4_1 0x13000b8d0
ggml_metal_init: loaded kernel_get_rows_q2_K 0x13000bfa0
ggml_metal_init: loaded kernel_get_rows_q3_K 0x13000c670
ggml_metal_init: loaded kernel_get_rows_q4_K 0x13000cd40
ggml_metal_init: loaded kernel_get_rows_q5_K 0x13000d410
ggml_metal_init: loaded kernel_get_rows_q6_K 0x13000dae0
ggml_metal_init: loaded kernel_rms_norm 0x13000e1c0
ggml_metal_init: loaded kernel_norm 0x13000ea00
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x13000f2d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x13000fa50
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x1300101d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32 0x130010ad0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32 0x130011250
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32 0x1300119d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32 0x130012150
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x130012ad0
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x1300134f0
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32 0x130013cb0
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32 0x130014470
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32 0x130014c30
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32 0x130015170
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32 0x130015930
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32 0x1300160f0
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32 0x1300168b0
ggml_metal_init: loaded kernel_rope 0x130016df0
ggml_metal_init: loaded kernel_alibi_f32 0x1300176d0
ggml_metal_init: loaded kernel_cpy_f32_f16 0x130017f80
ggml_metal_init: loaded kernel_cpy_f32_f32 0x130018830
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1300190e0
ggml_metal_init: recommendedMaxWorkingSetSize = 147456.00 MB
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: maxTransferRate = built-in GPU
llama_new_context_with_model: compute buffer total size = 145.35 MB
llama_new_context_with_model: max tensor size = 205.08 MB
ggml_metal_add_buffer: allocated 'data ' buffer, size = 37071.20 MB, (37071.64 / 147456.00)
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 1.36 MB, (37073.00 / 147456.00)
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 162.00 MB, (37235.00 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 144.02 MB, (37379.02 / 147456.00)
system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
perplexity: calculating perplexity over 655 chunks, batch_size=512
ggml_allocr_alloc: not enough space in the buffer (needed 58720256, largest block available 41943072)
GGML_ASSERT: ggml-alloc.c:133: !"not enough space in the buffer"
Abort trap: 6 |
|
This error is very surprising to me because the graph that the allocator uses to measure the buffer size should be identical to the graph used by the perplexity tool ( |
|
On a second thought, I don't think this could cause this issue. This could be diagnosed by defining |
|
There is a possibility that this could happen if the addresses of the weights are within this range: |
|
@ggerganov @slaren fixed in #2639 Strange that it didn't trigger before. Anyway, now our allocator is more robust.
|
|
Nice. Unrelated to that, I think that the |
|
Removing this line has no effect on token generation. Btw, may I ask what kinds of structs are stored there? |
|
This is the buffer used by the |
|
Yes, it should be removed. I was initially thinking about using it for storing temporary results (e.g. quantized versions of F32 intermediate tensors), but seems this will be obsoleted by ggml-org/ggml#455 |
|
Ahh, thank you for your explanation! |
Make ggml-alloc compatible with concurrent dispatch.
Test on M1 Max with
n_ctx=512, n_natch=512.Memory usage of compute buffer:
Token generation:
The reduced memory usage comes from #2411, and the inference speedup comes from #2358.