Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Jul 24, 2023

ref: ggml-org/ggml#293

See if #2355 can be resolved and also make sure we broadcast correctly across dims 2 and 3

Also, did a first try for block tiled matrix multiplication which seems to improve slightly the performance.

Currently, my tests on x86 show that OpenBLAS does not offer any advantage compared to non-BLAS both for prompt processing and text generation.

Looking for reports where this is not the case. Make sure to use thread count equal to the number of performance cores on your system.

TODO:

@ggerganov
Copy link
Member Author

@coadmonky

Can you check if this branch makes any performance difference on your system?

@coadmonky
Copy link

coadmonky commented Jul 24, 2023 via email

@yukiteruamano
Copy link
Contributor

Hi, I follow this issues too and these are the results in my case:

master branch

llama_print_timings:        load time =  1227,29 ms
llama_print_timings:      sample time =    83,16 ms /   136 runs   (    0,61 ms per token,  1635,36 tokens per second)
llama_print_timings: prompt eval time = 11299,55 ms /    53 tokens (  213,20 ms per token,     4,69 tokens per second)
llama_print_timings:        eval time = 30410,48 ms /   136 runs   (  223,61 ms per token,     4,47 tokens per second)
llama_print_timings:       total time = 162198,49 ms

mul-mat-tweaks branch

llama_print_timings:        load time =  1230,44 ms
llama_print_timings:      sample time =   141,66 ms /   250 runs   (    0,57 ms per token,  1764,84 tokens per second)
llama_print_timings: prompt eval time =  9965,68 ms /    53 tokens (  188,03 ms per token,     5,32 tokens per second)
llama_print_timings:        eval time = 56549,68 ms /   250 runs   (  226,20 ms per token,     4,42 tokens per second)
llama_print_timings:       total time = 128116,71 ms

The values are consistents into multiple runs

@ggerganov
Copy link
Member Author

@coadmonky

I believe that the latest commit in this branch 2076a9b should restore the old performance. Please let me know if you get the chance to give it a try

@ggerganov ggerganov marked this pull request as ready for review July 25, 2023 10:33
//}

for (int64_t ir0 = iir0; ir0 < iir0 + blck_0 && ir0 < ir011; ++ir0) {
vec_dot(ne00, &tmp[ir0 - iir0], src0_row + ir0*nb01, src1_col);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know vec_dot is cache unfriendly for gemm. IIRC you need a different loop order to reach high FLOPs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to suggestions for alternative implementation.

I think what I achieve with these changes is to to put a few rows from both tensors in L2 cache and do the full dot products. Maybe there is something smarter to utilize the L1 cache better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the best I came up for now without trying to rebuild OpenBLAS.

@netrunnereve
Copy link
Collaborator

netrunnereve commented Jul 27, 2023

With this change I think we can (almost) get rid of OpenBLAS! The numbers look great 👍.

These tests were done on 7B on a 4 core/8 thread Xeon v2, running on 4 threads.

master

llama_print_timings:        load time =   399.63 ms
llama_print_timings:      sample time =   253.82 ms /   100 runs   (    2.54 ms per token,   393.98 tokens per second)
llama_print_timings: prompt eval time = 72117.89 ms /   401 tokens (  179.85 ms per token,     5.56 tokens per second)
llama_print_timings:        eval time = 21412.89 ms /    99 runs   (  216.29 ms per token,     4.62 tokens per second)
llama_print_timings:       total time = 93809.82 ms

master + openblas

llama_print_timings:        load time =   436.86 ms
llama_print_timings:      sample time =   253.78 ms /   100 runs   (    2.54 ms per token,   394.04 tokens per second)
llama_print_timings: prompt eval time = 60198.77 ms /   401 tokens (  150.12 ms per token,     6.66 tokens per second)
llama_print_timings:        eval time = 21452.36 ms /    99 runs   (  216.69 ms per token,     4.61 tokens per second)
llama_print_timings:       total time = 81930.22 ms

master + clblast (only for comparison's sake)

llama_print_timings:        load time =   394.55 ms
llama_print_timings:      sample time =   253.29 ms /   100 runs   (    2.53 ms per token,   394.81 tokens per second)
llama_print_timings: prompt eval time = 11543.12 ms /   401 tokens (   28.79 ms per token,    34.74 tokens per second)
llama_print_timings:        eval time = 21335.05 ms /    99 runs   (  215.51 ms per token,     4.64 tokens per second)
llama_print_timings:       total time = 33155.94 ms

mul-mat-tweaks

llama_print_timings:        load time =   400.97 ms
llama_print_timings:      sample time =   251.99 ms /   100 runs   (    2.52 ms per token,   396.85 tokens per second)
llama_print_timings: prompt eval time = 62045.15 ms /   401 tokens (  154.73 ms per token,     6.46 tokens per second)
llama_print_timings:        eval time = 21038.19 ms /    99 runs   (  212.51 ms per token,     4.71 tokens per second)
llama_print_timings:       total time = 83359.80 ms

Additionally I get a small but consistent improvement in prompt eval times on 8 threads (keep in mind OpenBLAS already uses 8 threads as it ignores the --threads option). I think this is because prompt evaluation is not bottlenecked by memory like inference.

mul-mat-tweaks (8 threads)

llama_print_timings:        load time =   395.60 ms
llama_print_timings:      sample time =   251.41 ms /   100 runs   (    2.51 ms per token,   397.76 tokens per second)
llama_print_timings: prompt eval time = 56967.28 ms /   401 tokens (  142.06 ms per token,     7.04 tokens per second)
llama_print_timings:        eval time = 21124.33 ms /    99 runs   (  213.38 ms per token,     4.69 tokens per second)
llama_print_timings:       total time = 78367.45 ms

@ggerganov
Copy link
Member Author

Will complete the TODOs later in separate PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants