-
Notifications
You must be signed in to change notification settings - Fork 14.2k
ggml : mul mat tweaks #2372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : mul mat tweaks #2372
Conversation
ggml-ci
eab05a7 to
0822d27
Compare
b2e3e34 to
a2eb57e
Compare
|
Can you check if this branch makes any performance difference on your system? |
|
I'm at work right now...will check when I get home this evening.
|
|
Hi, I follow this issues too and these are the results in my case: master branch mul-mat-tweaks branch The values are consistents into multiple runs |
|
I believe that the latest commit in this branch 2076a9b should restore the old performance. Please let me know if you get the chance to give it a try |
ggml-ci
b324b04 to
450a7c7
Compare
| //} | ||
|
|
||
| for (int64_t ir0 = iir0; ir0 < iir0 + blck_0 && ir0 < ir011; ++ir0) { | ||
| vec_dot(ne00, &tmp[ir0 - iir0], src0_row + ir0*nb01, src1_col); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know vec_dot is cache unfriendly for gemm. IIRC you need a different loop order to reach high FLOPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to suggestions for alternative implementation.
I think what I achieve with these changes is to to put a few rows from both tensors in L2 cache and do the full dot products. Maybe there is something smarter to utilize the L1 cache better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the best I came up for now without trying to rebuild OpenBLAS.
|
With this change I think we can (almost) get rid of OpenBLAS! The numbers look great 👍. These tests were done on 7B on a 4 core/8 thread Xeon v2, running on 4 threads. Additionally I get a small but consistent improvement in prompt eval times on 8 threads (keep in mind OpenBLAS already uses 8 threads as it ignores the --threads option). I think this is because prompt evaluation is not bottlenecked by memory like inference. |
|
Will complete the TODOs later in separate PRs |
ref: ggml-org/ggml#293
See if #2355 can be resolved and also make sure we broadcast correctly across dims 2 and 3
Also, did a first try for block tiled matrix multiplication which seems to improve slightly the performance.
Currently, my tests on x86 show that OpenBLAS does not offer any advantage compared to non-BLAS both for prompt processing and text generation.
Looking for reports where this is not the case. Make sure to use thread count equal to the number of performance cores on your system.
TODO: