ggml : mul mat tweaks #2372

ggerganov · 2023-07-24T12:44:36Z

See if #2355 can be resolved and also make sure we broadcast correctly across dims 2 and 3

Also, did a first try for block tiled matrix multiplication which seems to improve slightly the performance.

Currently, my tests on x86 show that OpenBLAS does not offer any advantage compared to non-BLAS both for prompt processing and text generation.

Looking for reports where this is not the case. Make sure to use thread count equal to the number of performance cores on your system.

TODO:

add broadcast support for CPU BLAS path (Multiplying large matrices for batched BERT inference ggml#412 (comment))
add broadcast support in Metal

ggml-ci

ggerganov · 2023-07-24T13:43:37Z

@coadmonky

Can you check if this branch makes any performance difference on your system?

coadmonky · 2023-07-24T15:13:28Z

I'm at work right now...will check when I get home this evening.

yukiteruamano · 2023-07-24T18:25:10Z

Hi, I follow this issues too and these are the results in my case:

master branch

llama_print_timings:        load time =  1227,29 ms
llama_print_timings:      sample time =    83,16 ms /   136 runs   (    0,61 ms per token,  1635,36 tokens per second)
llama_print_timings: prompt eval time = 11299,55 ms /    53 tokens (  213,20 ms per token,     4,69 tokens per second)
llama_print_timings:        eval time = 30410,48 ms /   136 runs   (  223,61 ms per token,     4,47 tokens per second)
llama_print_timings:       total time = 162198,49 ms

mul-mat-tweaks branch

llama_print_timings:        load time =  1230,44 ms
llama_print_timings:      sample time =   141,66 ms /   250 runs   (    0,57 ms per token,  1764,84 tokens per second)
llama_print_timings: prompt eval time =  9965,68 ms /    53 tokens (  188,03 ms per token,     5,32 tokens per second)
llama_print_timings:        eval time = 56549,68 ms /   250 runs   (  226,20 ms per token,     4,42 tokens per second)
llama_print_timings:       total time = 128116,71 ms

The values are consistents into multiple runs

ggerganov · 2023-07-25T10:17:21Z

@coadmonky

I believe that the latest commit in this branch 2076a9b should restore the old performance. Please let me know if you get the chance to give it a try

ggml-ci

goerch · 2023-07-25T12:33:48Z

ggml.c

+                //}
+
+                for (int64_t ir0 = iir0; ir0 < iir0 + blck_0 && ir0 < ir011; ++ir0) {
+                    vec_dot(ne00, &tmp[ir0 - iir0], src0_row + ir0*nb01, src1_col);


As far as I know vec_dot is cache unfriendly for gemm. IIRC you need a different loop order to reach high FLOPs.

Open to suggestions for alternative implementation.

I think what I achieve with these changes is to to put a few rows from both tensors in L2 cache and do the full dot products. Maybe there is something smarter to utilize the L1 cache better?

Here is the best I came up for now without trying to rebuild OpenBLAS.

netrunnereve · 2023-07-27T00:37:33Z

With this change I think we can (almost) get rid of OpenBLAS! The numbers look great 👍.

These tests were done on 7B on a 4 core/8 thread Xeon v2, running on 4 threads.

master

llama_print_timings:        load time =   399.63 ms
llama_print_timings:      sample time =   253.82 ms /   100 runs   (    2.54 ms per token,   393.98 tokens per second)
llama_print_timings: prompt eval time = 72117.89 ms /   401 tokens (  179.85 ms per token,     5.56 tokens per second)
llama_print_timings:        eval time = 21412.89 ms /    99 runs   (  216.29 ms per token,     4.62 tokens per second)
llama_print_timings:       total time = 93809.82 ms

master + openblas

llama_print_timings:        load time =   436.86 ms
llama_print_timings:      sample time =   253.78 ms /   100 runs   (    2.54 ms per token,   394.04 tokens per second)
llama_print_timings: prompt eval time = 60198.77 ms /   401 tokens (  150.12 ms per token,     6.66 tokens per second)
llama_print_timings:        eval time = 21452.36 ms /    99 runs   (  216.69 ms per token,     4.61 tokens per second)
llama_print_timings:       total time = 81930.22 ms

master + clblast (only for comparison's sake)

llama_print_timings:        load time =   394.55 ms
llama_print_timings:      sample time =   253.29 ms /   100 runs   (    2.53 ms per token,   394.81 tokens per second)
llama_print_timings: prompt eval time = 11543.12 ms /   401 tokens (   28.79 ms per token,    34.74 tokens per second)
llama_print_timings:        eval time = 21335.05 ms /    99 runs   (  215.51 ms per token,     4.64 tokens per second)
llama_print_timings:       total time = 33155.94 ms

mul-mat-tweaks

llama_print_timings:        load time =   400.97 ms
llama_print_timings:      sample time =   251.99 ms /   100 runs   (    2.52 ms per token,   396.85 tokens per second)
llama_print_timings: prompt eval time = 62045.15 ms /   401 tokens (  154.73 ms per token,     6.46 tokens per second)
llama_print_timings:        eval time = 21038.19 ms /    99 runs   (  212.51 ms per token,     4.71 tokens per second)
llama_print_timings:       total time = 83359.80 ms

Additionally I get a small but consistent improvement in prompt eval times on 8 threads (keep in mind OpenBLAS already uses 8 threads as it ignores the --threads option). I think this is because prompt evaluation is not bottlenecked by memory like inference.

mul-mat-tweaks (8 threads)

llama_print_timings:        load time =   395.60 ms
llama_print_timings:      sample time =   251.41 ms /   100 runs   (    2.51 ms per token,   397.76 tokens per second)
llama_print_timings: prompt eval time = 56967.28 ms /   401 tokens (  142.06 ms per token,     7.04 tokens per second)
llama_print_timings:        eval time = 21124.33 ms /    99 runs   (  213.38 ms per token,     4.69 tokens per second)
llama_print_timings:       total time = 78367.45 ms

ggerganov · 2023-08-07T11:26:23Z

Will complete the TODOs later in separate PRs

ggml : mul mat wip

0822d27

ggml-ci

ggerganov force-pushed the mul-mat-tweaks branch from eab05a7 to 0822d27 Compare July 24, 2023 13:08

ggml : alternative thread distribution for mul_mat

a2eb57e

ggml-ci

ggerganov force-pushed the mul-mat-tweaks branch from b2e3e34 to a2eb57e Compare July 24, 2023 13:36

ggml : mul_mat block tiling attempt

2076a9b

ggerganov marked this pull request as ready for review July 25, 2023 10:33

ggml : mul_mat threads yield

450a7c7

ggml-ci

ggerganov force-pushed the mul-mat-tweaks branch from b324b04 to 450a7c7 Compare July 25, 2023 10:33

goerch suggested changes Jul 25, 2023

View reviewed changes

This was referenced Jul 25, 2023

Multiplying large matrices for batched BERT inference ggml-org/ggml#412

Open

Decrease in Performance #2355

Closed

netrunnereve mentioned this pull request Aug 6, 2023

Adjust prompt processing thread count seperately from n_threads #2534

Closed

ggerganov merged commit 93356bd into master Aug 7, 2023

ggerganov deleted the mul-mat-tweaks branch August 7, 2023 11:26

This was referenced Aug 15, 2023

Add a way to change the number of openblas threads in ggml ggml-org/ggml#452

Closed

Enabling cuda GPU acceleration leejet/stable-diffusion.cpp#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml : mul mat tweaks #2372

ggml : mul mat tweaks #2372

Uh oh!

ggerganov commented Jul 24, 2023 •

edited

Loading

Uh oh!

ggerganov commented Jul 24, 2023

Uh oh!

coadmonky commented Jul 24, 2023 via email

Uh oh!

yukiteruamano commented Jul 24, 2023

Uh oh!

ggerganov commented Jul 25, 2023

Uh oh!

goerch Jul 25, 2023

Uh oh!

ggerganov Jul 25, 2023

Uh oh!

goerch Jul 25, 2023

Uh oh!

netrunnereve commented Jul 27, 2023 •

edited

Loading

Uh oh!

ggerganov commented Aug 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggml : mul mat tweaks #2372

ggml : mul mat tweaks #2372

Uh oh!

Conversation

ggerganov commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 24, 2023

Uh oh!

coadmonky commented Jul 24, 2023 via email

Uh oh!

yukiteruamano commented Jul 24, 2023

Uh oh!

ggerganov commented Jul 25, 2023

Uh oh!

goerch Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

goerch Jul 25, 2023

Choose a reason for hiding this comment

Uh oh!

netrunnereve commented Jul 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Aug 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggerganov commented Jul 24, 2023 •

edited

Loading

netrunnereve commented Jul 27, 2023 •

edited

Loading