-
Notifications
You must be signed in to change notification settings - Fork 14.1k
server: introduce API for serving / loading / unloading multiple models #17470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: introduce API for serving / loading / unloading multiple models #17470
Conversation
…' into allozaur/server_model_management_v1_2
…' into allozaur/server_model_management_v1_2
|
Nice, thanks! Merging this once CI are all green. (Note: I'm running the same CI in my fork to skip the long waiting line) |
allozaur
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All is looking good on my end, let's ship it 🚀
|
Self-note: this also needs to be included in server's changelog |
|
Thank you, you just annihilated ollama for me. |
We will keep adding further enhancements to model management, downloading, configuring etc. Stay tuned! |
|
I'm also in the process of migrating from ollama and while searching on how to setup llama.cpp + llama-swap I came across this amazing new feature of multiple models and I have few questions:
Sorry if these questions have already been answered somewhere I'm very new to llama.cpp |
Yes it's the next important step for optimizing each model to max out GPU usage. We're working hard on this. It also requires attention to security (command injection via parameters). Ideally this config should be memorized by the backend to avoid reconfiguring everything when switching PCs (that's the whole point of a web app) but I'm not sure if that's immediately possible. |
Close #16487
Close #16256
Close #17556
For more detailes on WebUI changes, please refer to this comment from @allozaur : #17470 (comment)
This PR introduces the ability to use multiple models, unload/load them on the fly in
llama-serverThe API was designed to take advantage of OAI-compat
/v1/modelsendpoint, as well as the"model"in body payload for POST requests like/v1/chat/completions. By default, if the model is not yet loaded, it will be loaded automatically on-demand.This is the first version of the feature, aims to be experimental. Here is the list of capabilities:
"model"fieldOther features like downloading new models, delete cached models, real-time events, etc are planned for next iteration.
Example commands:
For the full documentation, please refer to the "Using multiple models" section of the new documentation
Note: waiting for further webui changes from @allozaur
Screen.Recording.2025-11-24.at.15.20.05.mp4
Implementation
The feature was implemented using multi-process approach. The reason for this choice is to be more resilient in case the model crashes.
Most of the implementation is confined inside
tools/server/server-models.cppThere will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.
This system was design and test against these unexpected cases:
GGML_ASSERT)These steps happen when user request the router to launch a model instance:
[port_number]If the child process exits, router server knows that as soon as stdout/stderr is closed
In reverse, from the router server:
exit(1)which cause the child process to exit immediatelysequenceDiagram router->>child: spawn with args child->>child: load model child->>router: POST ready status via API Note over child,router: Routing HTTP requests alt request shutdown router->>child: exit command (via stdin) child->>child: clean up & exit else router dead router-->child: stdin close child->>child: force exit endOther changes included in the PR:
DEFAULT_MODEL_PATH-m, --modelis not specified,common_params_parse_exwill return an error (except for server)AI usage disclosure: Most of the code here is human-written, except for
pipe_timplementation used byserver_http_proxyget_free_port()function