Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Nov 24, 2025

Close #16487
Close #16256
Close #17556

For more detailes on WebUI changes, please refer to this comment from @allozaur : #17470 (comment)

This PR introduces the ability to use multiple models, unload/load them on the fly in llama-server

The API was designed to take advantage of OAI-compat /v1/models endpoint, as well as the "model" in body payload for POST requests like /v1/chat/completions. By default, if the model is not yet loaded, it will be loaded automatically on-demand.

This is the first version of the feature, aims to be experimental. Here is the list of capabilities:

  • API for listing, loading, unloading models
  • Routing request based on "model" field
  • Limit maximum number of models to be loaded at the same time
  • Allow loading models from a local directory
  • (Advanced) allow specifying custom per-model config via API

Other features like downloading new models, delete cached models, real-time events, etc are planned for next iteration.

Example commands:

# start the server as router (using models in cache)
llama-server

# use GGUFs from a local directory - see directory structure in README.md
llama-server --models-dir ./my_models

# specify default arguments to be passed to models
llama-server -n 128 -ctx 8192 -ngl 4

# allow setting the arguments per-model via API (warning: only used in trusted network)
llama-server --models-allow-extra-args

For the full documentation, please refer to the "Using multiple models" section of the new documentation

Note: waiting for further webui changes from @allozaur

Screen.Recording.2025-11-24.at.15.20.05.mp4

Implementation

The feature was implemented using multi-process approach. The reason for this choice is to be more resilient in case the model crashes.

Most of the implementation is confined inside tools/server/server-models.cpp

There will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.

This system was design and test against these unexpected cases:

  • Child process suddenly exit due to error (for example, a GGML_ASSERT)
  • Child process failed to load (for example, the system cannot launch the process)
  • Router process suddenly exit due to error. In this case, child processes automatically stop themself.

These steps happen when user request the router to launch a model instance:

  1. Check if the model already had a process; if yes, skip
  2. Construct argv and envp to launch the child process; a random HTTP port is selected for each child process
  3. Start the child process
  4. Create a thread to read child's stdout/stderr and forward it to main process, with a prefix [port_number]
  5. Inside child process, it notifies router server its "ready" status, then spawn a thread to monitor its stdin

If the child process exits, router server knows that as soon as stdout/stderr is closed

In reverse, from the router server:

  • If the router server send a special command via stdin, the child process detects this command, call clean up function and exit gracefully
  • If router server crashes, the stdin will be closed. This will trigger exit(1) which cause the child process to exit immediately
sequenceDiagram
    router->>child: spawn with args
    child->>child: load model
    child->>router: POST ready status via API
    Note over child,router: Routing HTTP requests
    alt request shutdown
        router->>child: exit command (via stdin)
        child->>child: clean up & exit
    else router dead
        router-->child: stdin close
        child->>child: force exit
    end
Loading

Other changes included in the PR:

  • Added subprocess.h as new vedor
  • Remove DEFAULT_MODEL_PATH
  • If -m, --model is not specified, common_params_parse_ex will return an error (except for server)

AI usage disclosure: Most of the code here is human-written, except for

  • pipe_t implementation used by server_http_proxy
  • get_free_port() function

@allozaur
Copy link
Collaborator

allozaur commented Dec 1, 2025

For vis, there is still a small bug on webui where model get loaded automatically when user doesn't send any messages.

Waiting for the fix from @allozaur the fix, then I guess we can merge!

Just pushed last fixes 😄 @ngxson

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 1, 2025

Nice, thanks! Merging this once CI are all green.

(Note: I'm running the same CI in my fork to skip the long waiting line)

l2k36hk
l2k36hk previously approved these changes Dec 1, 2025
@allozaur allozaur dismissed l2k36hk’s stale review December 1, 2025 17:07

Not a review from maintainer

Copy link
Collaborator

@allozaur allozaur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All is looking good on my end, let's ship it 🚀

@allozaur allozaur linked an issue Dec 1, 2025 that may be closed by this pull request
@allozaur
Copy link
Collaborator

allozaur commented Dec 1, 2025

Also changes added here in this PR should fix #17556. @pwilkin please check and let me know if the problem is resolved on your end :)

@ngxson ngxson merged commit ec18edf into ggml-org:master Dec 1, 2025
66 of 69 checks passed
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 1, 2025

Self-note: this also needs to be included in server's changelog

@aviallon
Copy link
Contributor

aviallon commented Dec 3, 2025

Thank you, you just annihilated ollama for me.

@allozaur
Copy link
Collaborator

allozaur commented Dec 3, 2025

Thank you, you just annihilated ollama for me.

We will keep adding further enhancements to model management, downloading, configuring etc. Stay tuned!

@romaingyh
Copy link

I'm also in the process of migrating from ollama and while searching on how to setup llama.cpp + llama-swap I came across this amazing new feature of multiple models and I have few questions:

  • I don't see --models-allow-extra-args in the code so is it by default integrated?
  • Is there a way to configure default parameters like temperature or top_k for each model?
  • When a model is loaded, is it unloaded after a time of inactivity?

Sorry if these questions have already been answered somewhere I'm very new to llama.cpp

@ServeurpersoCom
Copy link
Collaborator

I'm also in the process of migrating from ollama and while searching on how to setup llama.cpp + llama-swap I came across this amazing new feature of multiple models and I have few questions:

* I don't see `--models-allow-extra-args`  in the code so is it by default integrated?

* Is there a way to configure default parameters like temperature or top_k for each model?

* When a model is loaded, is it unloaded after a time of inactivity?

Sorry if these questions have already been answered somewhere I'm very new to llama.cpp

Yes it's the next important step for optimizing each model to max out GPU usage. We're working hard on this. It also requires attention to security (command injection via parameters). Ideally this config should be memorized by the backend to avoid reconfiguring everything when switching PCs (that's the whole point of a web app) but I'm not sure if that's immediately possible.
For --models-allow-extra-args: yeah it needs proper validation, otherwise anyone can pass arbitrary args.
For auto-unload on inactivity: not yet implemented but would be a good addition to avoid wasting VRAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes script Script related server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: Regression on file uploads Feature request: allow load/unload models on server Update llama-server README secion about WebUI

10 participants