server: introduce API for serving / loading / unloading multiple models #17470

ngxson · 2025-11-24T14:31:31Z

For more detailes on WebUI changes, please refer to this comment from @allozaur : #17470 (comment)

This PR introduces the ability to use multiple models, unload/load them on the fly in llama-server

The API was designed to take advantage of OAI-compat /v1/models endpoint, as well as the "model" in body payload for POST requests like /v1/chat/completions. By default, if the model is not yet loaded, it will be loaded automatically on-demand.

This is the first version of the feature, aims to be experimental. Here is the list of capabilities:

API for listing, loading, unloading models
Routing request based on "model" field
Limit maximum number of models to be loaded at the same time
Allow loading models from a local directory
(Advanced) allow specifying custom per-model config via API

Other features like downloading new models, delete cached models, real-time events, etc are planned for next iteration.

Example commands:

# start the server as router (using models in cache)
llama-server

# use GGUFs from a local directory - see directory structure in README.md
llama-server --models-dir ./my_models

# specify default arguments to be passed to models
llama-server -n 128 -ctx 8192 -ngl 4

# allow setting the arguments per-model via API (warning: only used in trusted network)
llama-server --models-allow-extra-args

For the full documentation, please refer to the "Using multiple models" section of the new documentation

Note: waiting for further webui changes from @allozaur

Screen.Recording.2025-11-24.at.15.20.05.mp4

Implementation

The feature was implemented using multi-process approach. The reason for this choice is to be more resilient in case the model crashes.

Most of the implementation is confined inside tools/server/server-models.cpp

There will be one main "router" server whose the job is to create other "child" processes that will actually run the inference.

This system was design and test against these unexpected cases:

Child process suddenly exit due to error (for example, a GGML_ASSERT)
Child process failed to load (for example, the system cannot launch the process)
Router process suddenly exit due to error. In this case, child processes automatically stop themself.

These steps happen when user request the router to launch a model instance:

Check if the model already had a process; if yes, skip
Construct argv and envp to launch the child process; a random HTTP port is selected for each child process
Start the child process
Create a thread to read child's stdout/stderr and forward it to main process, with a prefix [port_number]
Inside child process, it notifies router server its "ready" status, then spawn a thread to monitor its stdin

If the child process exits, router server knows that as soon as stdout/stderr is closed

In reverse, from the router server:

If the router server send a special command via stdin, the child process detects this command, call clean up function and exit gracefully
If router server crashes, the stdin will be closed. This will trigger exit(1) which cause the child process to exit immediately

sequenceDiagram
    router->>child: spawn with args
    child->>child: load model
    child->>router: POST ready status via API
    Note over child,router: Routing HTTP requests
    alt request shutdown
        router->>child: exit command (via stdin)
        child->>child: clean up & exit
    else router dead
        router-->child: stdin close
        child->>child: force exit
    end

Other changes included in the PR:

Added subprocess.h as new vedor
Remove DEFAULT_MODEL_PATH
If -m, --model is not specified, common_params_parse_ex will return an error (except for server)

AI usage disclosure: Most of the code here is human-written, except for

pipe_t implementation used by server_http_proxy
get_free_port() function

…' into allozaur/server_model_management_v1_2

…odel used info

tools/server/server-models.cpp

allozaur · 2025-12-01T16:16:25Z

For vis, there is still a small bug on webui where model get loaded automatically when user doesn't send any messages.

Waiting for the fix from @allozaur the fix, then I guess we can merge!

Just pushed last fixes 😄 @ngxson

ngxson · 2025-12-01T16:31:32Z

Nice, thanks! Merging this once CI are all green.

(Note: I'm running the same CI in my fork to skip the long waiting line)

Not a review from maintainer

allozaur

All is looking good on my end, let's ship it 🚀

allozaur · 2025-12-01T18:34:29Z

Also changes added here in this PR should fix #17556. @pwilkin please check and let me know if the problem is resolved on your end :)

ngxson · 2025-12-01T18:42:07Z

Self-note: this also needs to be included in server's changelog

aviallon · 2025-12-03T11:19:44Z

Thank you, you just annihilated ollama for me.

allozaur · 2025-12-03T17:42:05Z

Thank you, you just annihilated ollama for me.

We will keep adding further enhancements to model management, downloading, configuring etc. Stay tuned!

romaingyh · 2025-12-04T18:13:18Z

I'm also in the process of migrating from ollama and while searching on how to setup llama.cpp + llama-swap I came across this amazing new feature of multiple models and I have few questions:

I don't see --models-allow-extra-args in the code so is it by default integrated?
Is there a way to configure default parameters like temperature or top_k for each model?
When a model is loaded, is it unloaded after a time of inactivity?

Sorry if these questions have already been answered somewhere I'm very new to llama.cpp

ServeurpersoCom · 2025-12-04T18:34:22Z

I'm also in the process of migrating from ollama and while searching on how to setup llama.cpp + llama-swap I came across this amazing new feature of multiple models and I have few questions:
* I don't see `--models-allow-extra-args`  in the code so is it by default integrated?

* Is there a way to configure default parameters like temperature or top_k for each model?

* When a model is loaded, is it unloaded after a time of inactivity?
Sorry if these questions have already been answered somewhere I'm very new to llama.cpp

Yes it's the next important step for optimizing each model to max out GPU usage. We're working hard on this. It also requires attention to security (command injection via parameters). Ideally this config should be memorized by the backend to avoid reconfiguring everything when switching PCs (that's the whole point of a web app) but I'm not sure if that's immediately possible.
For --models-allow-extra-args: yeah it needs proper validation, otherwise anyone can pass arbitrary args.
For auto-unload on inactivity: not yet implemented but would be a good addition to avoid wasting VRAM.

ngxson and others added 30 commits November 19, 2025 21:23

server: add model management and proxy

fc5901a

fix compile error

399f536

does this fix windows?

abc0ca4

fix windows build

54b3545

use subprocess.h, better logging

5423d42

add test

0ef3b61

fix windows

7c6eb17

Merge branch 'master' into xsn/server_model_management_v1_2

919d3f8

feat: Model/Router server architecture WIP

55d33a8

more stable

b9ebdf6

fix unsafe pointer

6610724

also allow terminate loading model

d0ea9e0

add is_active()

5805ca7

refactor: Architecture improvements

8a88576

Merge remote-tracking branch 'ngxson/xsn/server_model_management_v1_2…

c35dee3

…' into allozaur/server_model_management_v1_2

tmp apply upstream fix

2161408

address most problems

5369aaa

address thread safety issue

6929c9f

address review comment

be25bcc

add docs (first version)

cd5c699

address review comment

a2e912c

feat: Improved UX for model information, modality interactions etc

4bf82a1

chore: update webui build output

cc88f6a

Merge remote-tracking branch 'ngxson/xsn/server_model_management_v1_2…

45bf2a4

…' into allozaur/server_model_management_v1_2

refactor: Use only the message data model property for displaying m…

049f40d

…odel used info

chore: update webui build output

c26c340

add --models-dir param

032b9ff

feat: New Model Selection UX WIP

8b1d967

chore: update webui build output

6b7c0a5

feat: Add auto-mic setting

69503aa

allozaur and others added 2 commits December 1, 2025 16:45

chore: update webui build output

9d3b718

fix ubuntu build

a6d3f83

danbev reviewed Dec 1, 2025

View reviewed changes

tools/server/server-models.cpp Show resolved Hide resolved

allozaur added 3 commits December 1, 2025 17:01

fix: Model status reactivity

b926cfa

fix: Modality detection for MODEL mode

01ed8ce

chore: update webui build output

b10d950

l2k36hk previously approved these changes Dec 1, 2025

View reviewed changes

allozaur approved these changes Dec 1, 2025

View reviewed changes

allozaur linked an issue Dec 1, 2025 that may be closed by this pull request

Misc. bug: Regression on file uploads #17556

Closed

ngxson merged commit ec18edf into ggml-org:master Dec 1, 2025
66 of 69 checks passed

This was referenced Dec 1, 2025

Feature Request: tool to list and delete cached models #16393

Open

Fix unreadable user markdown colors and truncate long texts in deletion dialogs #17555

Open

webui: MCP client with low coupling to current codebase #17487

Open

dwrz mentioned this pull request Dec 1, 2025

Feature request: allow load/unload models on server #16487

Closed

ngxson mentioned this pull request Dec 1, 2025

changelog : llama-server REST API #9291

Open

arkamar mentioned this pull request Dec 2, 2025

server: Doubled http headers in /v1/chat/completions response in multiple models mode #17693

Closed

thomasjfox mentioned this pull request Dec 4, 2025

webui: add "delete all conversations" button to import/export tab #17444

Open

allozaur mentioned this pull request Dec 5, 2025

webui: Per-conversation system message with UI displaying, edition & branching #17275

Merged

ngxson mentioned this pull request Dec 7, 2025

Proposal: allow arg.cpp to import/export configs from/to INI file #17850

Closed

ServeurpersoCom mentioned this pull request Dec 8, 2025

server: add presets (config) when using multiple models #17859

Merged

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: introduce API for serving / loading / unloading multiple models #17470

server: introduce API for serving / loading / unloading multiple models #17470

ngxson commented Nov 24, 2025 •

edited by allozaur

Loading

Uh oh!

Uh oh!

allozaur commented Dec 1, 2025

Uh oh!

ngxson commented Dec 1, 2025

Uh oh!

allozaur left a comment

Uh oh!

allozaur commented Dec 1, 2025

Uh oh!

Uh oh!

ngxson commented Dec 1, 2025

Uh oh!

aviallon commented Dec 3, 2025

Uh oh!

allozaur commented Dec 3, 2025

Uh oh!

romaingyh commented Dec 4, 2025

Uh oh!

ServeurpersoCom commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

server: introduce API for serving / loading / unloading multiple models #17470

server: introduce API for serving / loading / unloading multiple models #17470

Conversation

ngxson commented Nov 24, 2025 • edited by allozaur Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Uh oh!

Uh oh!

allozaur commented Dec 1, 2025

Uh oh!

ngxson commented Dec 1, 2025

Uh oh!

allozaur left a comment

Choose a reason for hiding this comment

Uh oh!

allozaur commented Dec 1, 2025

Uh oh!

Uh oh!

ngxson commented Dec 1, 2025

Uh oh!

aviallon commented Dec 3, 2025

Uh oh!

allozaur commented Dec 3, 2025

Uh oh!

romaingyh commented Dec 4, 2025

Uh oh!

ServeurpersoCom commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

ngxson commented Nov 24, 2025 •

edited by allozaur

Loading