Skip to content

Conversation

@NeoZhangJianyu
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu commented Mar 6, 2024

  1. add wait() to make code stable.
  2. use fp32 on oneMKL gemm_batch for better performance.
  3. add debug function.

Current performance reference:

GPU: 1 Arc 770
OS: ubuntu 22.04
Param: -mg 0 -sm none
Model: llama-2-7b.Q4_0.gguf

Avg: 30.66 tokens per second

@NeoZhangJianyu NeoZhangJianyu requested a review from airMeng March 6, 2024 03:33
Copy link
Contributor

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better leave some performance data for future reference

@NeoZhangJianyu
Copy link
Collaborator Author

better leave some performance data for future reference

Yes, update it in first comment.

@airMeng
Copy link
Contributor

airMeng commented Mar 6, 2024

better leave some performance data for future reference

Yes, update it in first comment.

comparisons before and after?

@NeoZhangJianyu NeoZhangJianyu merged commit 8ced9f7 into ggml-org:master Mar 6, 2024
hazelnutcloud pushed a commit to hazelnutcloud/llama.cpp that referenced this pull request Mar 10, 2024
NeoZhangJianyu added a commit to NeoZhangJianyu/llama.cpp that referenced this pull request Mar 12, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants