feat: support speculative decoding for llamacpp#402
Conversation
| // ConfigName represents the recommended configuration name for the backend, | ||
| // It will be inferred from the models in the runtime if not specified, e.g. default, | ||
| // speculative-decoding. | ||
| // +kubebuilder:default=default |
There was a problem hiding this comment.
Why remove the default value of the ConfigName field?
llmaz infers the recommended configuration name for the backend if not specified. However, the kubebuilder:default=default annotation prevents this inference by always setting ConfigName to "default" instead of leaving it as nil, bypassing the role-based detection logic by mistake.
There was a problem hiding this comment.
Set the default value here doesn't make any difference, right? The inference is just a guardrail I believe.
There was a problem hiding this comment.
If we set the default value here, when we don't define configName in the Playground, even though we define main and draft models, the configName will always be set as default instead of speculative-decoding.
There was a problem hiding this comment.
Recalled why we remove this before. Make sense to me.
|
/kind feature |
|
I will take a look tonight or tomorrow at the latest. |
| // ConfigName represents the recommended configuration name for the backend, | ||
| // It will be inferred from the models in the runtime if not specified, e.g. default, | ||
| // speculative-decoding. | ||
| // +kubebuilder:default=default |
There was a problem hiding this comment.
Set the default value here doesn't make any difference, right? The inference is just a guardrail I believe.
|
/lgtm Thanks @cr7258 |
|
/triage accepted |
What this PR does / why we need it
Support speculative decoding for llama.cpp, which significantly improves response latency by leveraging draft model predictions. From the logs, we can see the main model and draft model are loaded successfully.
Send a inference request:
curl --request POST \ --url http://localhost:8080/v1/completions \ --header "Content-Type: application/json" \ --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}' {"choices":[{"text":"\n1. Register a domain name (if you don’t have one yet)\n2. Choose a web host\n3. Choose a WordPress theme\n4. Create your website (using WordPress)\n5. Create and upload your website\n6. Optimize your website for search engines\n8. Test your website\n9. Monitor your website\n10. Update your website\nBuilding a website is not difficult if you have the right tools and resources.\nIn this article, we’ll cover everything you need to know to get started, including choosing a domain name, finding a web","index":0,"logprobs":null,"finish_reason":"length"}],"created":1746539608,"model":"gpt-3.5-turbo","system_fingerprint":"b5280-27aa2595","object":"text_completion","usage":{"completion_tokens":128,"prompt_tokens":14,"total_tokens":142},"id":"chatcmpl-mNfNdrU0Kivf873PYZyw0R5Ahhi1KNnX","timings":{"prompt_n":1,"prompt_ms":194.28,"prompt_per_token_ms":194.28,"prompt_per_second":5.147210212065061,"predicted_n":128,"predicted_ms":35572.597,"predicted_per_token_ms":277.9109140625,"predicted_per_second":3.598275380344033,"draft_n":155,"draft_n_accepted":15}}Which issue(s) this PR fixes
Fixes #240
Special notes for your reviewer
Does this PR introduce a user-facing change?