metal: enable ggml-alloc #2627

lshzh-ww · 2023-08-16T01:30:09Z

Make ggml-alloc compatible with concurrent dispatch.

Test on M1 Max with n_ctx=512, n_natch=512.
Memory usage of compute buffer:

Model	Master	PR
33B Q4_0	472MB	117MB

Token generation:

Model	Master	PR
33B Q4_0	65.4ms/tok	62.5ms/tok

The reduced memory usage comes from #2411, and the inference speedup comes from #2358.

Make ggml-alloc work with concurrently dispatch.

slaren

I cannot test it, but looks good.

llama.cpp

Co-authored-by: slaren <[email protected]>

ggerganov · 2023-08-16T20:25:37Z

The perplexity tool fails for 70B model both with and without metal:

make -j && ./perplexity -m models/70B-v2/ggml-model-q4_0.bin -f build/wikitext-2-raw/wiki.test.raw -gqa 8

Based on what I'm seeing, it fails when BLAS / Accelerate is enabled.
Seems to be related to ggml-alloc but not 100%. Will try to debug tomorrow and trace to see at which point it started failing.

ggerganov · 2023-08-16T20:36:34Z

NVM, without Metal it fails because of this:

https://github.com/ggerganov/llama.cpp/blob/0919a0f73d95cfb93a1646a1d1741a0615fe2c5e/ggml.c#L10660-L10662

However, with Metal enabled, it should work I think. However, I get this error:

$ ▶ LLAMA_METAL=1 make -j && ./perplexity -m models/70B-v2/ggml-model-q4_0.bin -f build/wikitext-2-raw/wiki.test.raw -ngl 1 -gqa 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: build = 992 (0919a0f)
main: seed  = 1692218180
llama.cpp: loading model from models/70B-v2/ggml-model-q4_0.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 37070.96 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x130006dd0
ggml_metal_init: loaded kernel_add_row                        0x130007510
ggml_metal_init: loaded kernel_mul                            0x130007a50
ggml_metal_init: loaded kernel_mul_row                        0x1300080a0
ggml_metal_init: loaded kernel_scale                          0x1300085e0
ggml_metal_init: loaded kernel_silu                           0x130008b20
ggml_metal_init: loaded kernel_relu                           0x130009060
ggml_metal_init: loaded kernel_gelu                           0x1300095a0
ggml_metal_init: loaded kernel_soft_max                       0x130009c70
ggml_metal_init: loaded kernel_diag_mask_inf                  0x13000a2f0
ggml_metal_init: loaded kernel_get_rows_f16                   0x13000a9c0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x13000b200
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13000b8d0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13000bfa0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13000c670
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13000cd40
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x13000d410
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13000dae0
ggml_metal_init: loaded kernel_rms_norm                       0x13000e1c0
ggml_metal_init: loaded kernel_norm                           0x13000ea00
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13000f2d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13000fa50
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x1300101d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x130010ad0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x130011250
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1300119d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x130012150
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x130012ad0
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x1300134f0
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x130013cb0
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x130014470
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x130014c30
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x130015170
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x130015930
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x1300160f0
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x1300168b0
ggml_metal_init: loaded kernel_rope                           0x130016df0
ggml_metal_init: loaded kernel_alibi_f32                      0x1300176d0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x130017f80
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x130018830
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x1300190e0
ggml_metal_init: recommendedMaxWorkingSetSize = 147456.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: compute buffer total size =  145.35 MB
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 37071.20 MB, (37071.64 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =     1.36 MB, (37073.00 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (37235.00 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   144.02 MB, (37379.02 / 147456.00)

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity: calculating perplexity over 655 chunks, batch_size=512
ggml_allocr_alloc: not enough space in the buffer (needed 58720256, largest block available 41943072)
GGML_ASSERT: ggml-alloc.c:133: !"not enough space in the buffer"
Abort trap: 6

slaren · 2023-08-16T20:58:08Z

This error is very surprising to me because the graph that the allocator uses to measure the buffer size should be identical to the graph used by the perplexity tool (n_past=0, n_tokens=512), so the allocations should be identical in both cases.

slaren · 2023-08-16T21:03:51Z

Ok, there is one possible difference. llama_build_graph has some calls to ggml_allocr_alloc that should use the metal parse_seq. However, by the time the measure graph is built, parse_seq is still not set. So the measure graph may need to be rebuilt after calling ggml_allocr_set_parse_seq.

On a second thought, I don't think this could cause this issue.

This could be diagnosed by defining GGML_ALLOCATOR_DEBUG and AT_PRINTF to see the trace of allocations, and comparing the trace of the measure graph and the graph where this error happens.

slaren · 2023-08-16T21:40:31Z

There is a possibility that this could happen if the addresses of the weights are within this range:
https://github.com/ggerganov/llama.cpp/blob/13a746c6b963ab13818f4d401d5de9966f5af1f5/ggml-alloc.c#L278-L279

lshzh-ww · 2023-08-17T01:28:58Z

@ggerganov @slaren fixed in #2639

Strange that it didn't trigger before. Anyway, now our allocator is more robust.

AT_PRINTF helped to locate this.

slaren · 2023-08-17T02:04:48Z

Nice. Unrelated to that, I think that the buf_compute could be removed from Metal, since it is not used to store tensor data, only the structs.
https://github.com/ggerganov/llama.cpp/blob/13a746c6b963ab13818f4d401d5de9966f5af1f5/llama.cpp#L3359

lshzh-ww · 2023-08-17T02:19:20Z

Removing this line has no effect on token generation. Btw, may I ask what kinds of structs are stored there?

slaren · 2023-08-17T02:39:33Z

This is the buffer used by the ggml_context to build the graph, and it is used to allocate the ggml_tensor (minus the data) and the ggml_cgraph structs. These structs are only accessed from the CPU, so it shouldn't be necessary to add it to Metal.

ggerganov · 2023-08-17T08:01:36Z

Yes, it should be removed. I was initially thinking about using it for storing temporary results (e.g. quantized versions of F32 intermediate tensors), but seems this will be obsoleted by ggml-org/ggml#455
So we can safely remove this

lshzh-ww · 2023-08-17T14:33:31Z

Ahh, thank you for your explanation!

lshzh-ww requested a review from slaren August 16, 2023 01:30

metal: enable ggml-alloc

9b9905f

Make ggml-alloc work with concurrently dispatch.

lshzh-ww force-pushed the metal-memory-alloc branch from 9167c84 to 9b9905f Compare August 16, 2023 01:35

slaren approved these changes Aug 16, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

llama.cpp Outdated Show resolved Hide resolved

style-fix

444e781

Co-authored-by: slaren <[email protected]>

ggerganov approved these changes Aug 16, 2023

View reviewed changes

ggerganov mentioned this pull request Aug 16, 2023

GGUF #2398

Merged

34 tasks

Merge branch 'master' into metal-memory-alloc

13a746c

ggerganov merged commit fc8ef54 into ggml-org:master Aug 16, 2023

Green-Sky mentioned this pull request Aug 18, 2023

llama : grouped-query attention + LLaMAv2 70B support #2276

Merged

metal: enable ggml-alloc #2627

metal: enable ggml-alloc #2627

Uh oh!

Conversation

lshzh-ww commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 16, 2023

Uh oh!

slaren commented Aug 16, 2023

Uh oh!

slaren commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 16, 2023

Uh oh!

lshzh-ww commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 17, 2023

Uh oh!

lshzh-ww commented Aug 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 17, 2023

Uh oh!

ggerganov commented Aug 17, 2023

Uh oh!

lshzh-ww commented Aug 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lshzh-ww commented Aug 16, 2023 •

edited

Loading

ggerganov commented Aug 16, 2023 •

edited

Loading

slaren commented Aug 16, 2023 •

edited

Loading

lshzh-ww commented Aug 17, 2023 •

edited

Loading

lshzh-ww commented Aug 17, 2023 •

edited

Loading