Skip to content

Conversation

@lshzh-ww
Copy link
Contributor

@lshzh-ww lshzh-ww commented Aug 16, 2023

Make ggml-alloc compatible with concurrent dispatch.

Test on M1 Max with n_ctx=512, n_natch=512.
Memory usage of compute buffer:

Model Master PR
33B Q4_0 472MB 117MB

Token generation:

Model Master PR
33B Q4_0 65.4ms/tok 62.5ms/tok

The reduced memory usage comes from #2411, and the inference speedup comes from #2358.

@lshzh-ww lshzh-ww requested a review from slaren August 16, 2023 01:30
Make ggml-alloc work with concurrently dispatch.
@lshzh-ww lshzh-ww force-pushed the metal-memory-alloc branch from 9167c84 to 9b9905f Compare August 16, 2023 01:35
Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot test it, but looks good.

Co-authored-by: slaren <[email protected]>
@ggerganov ggerganov mentioned this pull request Aug 16, 2023
34 tasks
@ggerganov ggerganov merged commit fc8ef54 into ggml-org:master Aug 16, 2023
@ggerganov
Copy link
Member

ggerganov commented Aug 16, 2023

The perplexity tool fails for 70B model both with and without metal:

make -j && ./perplexity -m models/70B-v2/ggml-model-q4_0.bin -f build/wikitext-2-raw/wiki.test.raw -gqa 8

Based on what I'm seeing, it fails when BLAS / Accelerate is enabled.
Seems to be related to ggml-alloc but not 100%. Will try to debug tomorrow and trace to see at which point it started failing.

@ggerganov
Copy link
Member

NVM, without Metal it fails because of this:

https://github.com/ggerganov/llama.cpp/blob/0919a0f73d95cfb93a1646a1d1741a0615fe2c5e/ggml.c#L10660-L10662

However, with Metal enabled, it should work I think. However, I get this error:

$LLAMA_METAL=1 make -j && ./perplexity -m models/70B-v2/ggml-model-q4_0.bin -f build/wikitext-2-raw/wiki.test.raw -ngl 1 -gqa 8
I llama.cpp build info: 
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_METAL
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

make: Nothing to be done for `default'.
main: build = 992 (0919a0f)
main: seed  = 1692218180
llama.cpp: loading model from models/70B-v2/ggml-model-q4_0.bin
llama_model_load_internal: warning: assuming 70B model based on GQA == 8
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 4096
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 8
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 8
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 28672
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 70B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 37070.96 MB (+  160.00 MB per state)
llama_new_context_with_model: kv self size  =  160.00 MB
ggml_metal_init: allocating
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                            0x130006dd0
ggml_metal_init: loaded kernel_add_row                        0x130007510
ggml_metal_init: loaded kernel_mul                            0x130007a50
ggml_metal_init: loaded kernel_mul_row                        0x1300080a0
ggml_metal_init: loaded kernel_scale                          0x1300085e0
ggml_metal_init: loaded kernel_silu                           0x130008b20
ggml_metal_init: loaded kernel_relu                           0x130009060
ggml_metal_init: loaded kernel_gelu                           0x1300095a0
ggml_metal_init: loaded kernel_soft_max                       0x130009c70
ggml_metal_init: loaded kernel_diag_mask_inf                  0x13000a2f0
ggml_metal_init: loaded kernel_get_rows_f16                   0x13000a9c0
ggml_metal_init: loaded kernel_get_rows_q4_0                  0x13000b200
ggml_metal_init: loaded kernel_get_rows_q4_1                  0x13000b8d0
ggml_metal_init: loaded kernel_get_rows_q2_K                  0x13000bfa0
ggml_metal_init: loaded kernel_get_rows_q3_K                  0x13000c670
ggml_metal_init: loaded kernel_get_rows_q4_K                  0x13000cd40
ggml_metal_init: loaded kernel_get_rows_q5_K                  0x13000d410
ggml_metal_init: loaded kernel_get_rows_q6_K                  0x13000dae0
ggml_metal_init: loaded kernel_rms_norm                       0x13000e1c0
ggml_metal_init: loaded kernel_norm                           0x13000ea00
ggml_metal_init: loaded kernel_mul_mat_f16_f32                0x13000f2d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32               0x13000fa50
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32               0x1300101d0
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32               0x130010ad0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32               0x130011250
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32               0x1300119d0
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32               0x130012150
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32               0x130012ad0
ggml_metal_init: loaded kernel_mul_mm_f16_f32                 0x1300134f0
ggml_metal_init: loaded kernel_mul_mm_q4_0_f32                0x130013cb0
ggml_metal_init: loaded kernel_mul_mm_q4_1_f32                0x130014470
ggml_metal_init: loaded kernel_mul_mm_q2_K_f32                0x130014c30
ggml_metal_init: loaded kernel_mul_mm_q3_K_f32                0x130015170
ggml_metal_init: loaded kernel_mul_mm_q4_K_f32                0x130015930
ggml_metal_init: loaded kernel_mul_mm_q5_K_f32                0x1300160f0
ggml_metal_init: loaded kernel_mul_mm_q6_K_f32                0x1300168b0
ggml_metal_init: loaded kernel_rope                           0x130016df0
ggml_metal_init: loaded kernel_alibi_f32                      0x1300176d0
ggml_metal_init: loaded kernel_cpy_f32_f16                    0x130017f80
ggml_metal_init: loaded kernel_cpy_f32_f32                    0x130018830
ggml_metal_init: loaded kernel_cpy_f16_f16                    0x1300190e0
ggml_metal_init: recommendedMaxWorkingSetSize = 147456.00 MB
ggml_metal_init: hasUnifiedMemory             = true
ggml_metal_init: maxTransferRate              = built-in GPU
llama_new_context_with_model: compute buffer total size =  145.35 MB
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size = 37071.20 MB, (37071.64 / 147456.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =     1.36 MB, (37073.00 / 147456.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   162.00 MB, (37235.00 / 147456.00)
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =   144.02 MB, (37379.02 / 147456.00)

system_info: n_threads = 16 / 24 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
perplexity: calculating perplexity over 655 chunks, batch_size=512
ggml_allocr_alloc: not enough space in the buffer (needed 58720256, largest block available 41943072)
GGML_ASSERT: ggml-alloc.c:133: !"not enough space in the buffer"
Abort trap: 6

@slaren
Copy link
Member

slaren commented Aug 16, 2023

This error is very surprising to me because the graph that the allocator uses to measure the buffer size should be identical to the graph used by the perplexity tool (n_past=0, n_tokens=512), so the allocations should be identical in both cases.

@slaren
Copy link
Member

slaren commented Aug 16, 2023

Ok, there is one possible difference. llama_build_graph has some calls to ggml_allocr_alloc that should use the metal parse_seq. However, by the time the measure graph is built, parse_seq is still not set. So the measure graph may need to be rebuilt after calling ggml_allocr_set_parse_seq.

On a second thought, I don't think this could cause this issue.

This could be diagnosed by defining GGML_ALLOCATOR_DEBUG and AT_PRINTF to see the trace of allocations, and comparing the trace of the measure graph and the graph where this error happens.

@slaren
Copy link
Member

slaren commented Aug 16, 2023

There is a possibility that this could happen if the addresses of the weights are within this range:
https://github.com/ggerganov/llama.cpp/blob/13a746c6b963ab13818f4d401d5de9966f5af1f5/ggml-alloc.c#L278-L279

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Aug 17, 2023

@ggerganov @slaren fixed in #2639

Strange that it didn't trigger before. Anyway, now our allocator is more robust.

AT_PRINTF helped to locate this.

@slaren
Copy link
Member

slaren commented Aug 17, 2023

Nice. Unrelated to that, I think that the buf_compute could be removed from Metal, since it is not used to store tensor data, only the structs.
https://github.com/ggerganov/llama.cpp/blob/13a746c6b963ab13818f4d401d5de9966f5af1f5/llama.cpp#L3359

@lshzh-ww
Copy link
Contributor Author

lshzh-ww commented Aug 17, 2023

Removing this line has no effect on token generation. Btw, may I ask what kinds of structs are stored there?

@slaren
Copy link
Member

slaren commented Aug 17, 2023

This is the buffer used by the ggml_context to build the graph, and it is used to allocate the ggml_tensor (minus the data) and the ggml_cgraph structs. These structs are only accessed from the CPU, so it shouldn't be necessary to add it to Metal.

@ggerganov
Copy link
Member

Yes, it should be removed. I was initially thinking about using it for storing temporary results (e.g. quantized versions of F32 intermediate tensors), but seems this will be obsoleted by ggml-org/ggml#455
So we can safely remove this

@lshzh-ww
Copy link
Contributor Author

Ahh, thank you for your explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants