Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 6, 2025

Ref: #17618

Fix: #11202

We are moving to a new CLI experience with the main code built on top of llama-server. This brings many additional features into llama-cli, making the experience feels mostly like a smaller version of the web UI:

  • Multimodal support
  • Regenerate last message
  • Speculative decoding
  • Fully jinja support (including some edge cases that old llama-cli doesn't support)
image

TODO:

Features planned for next versions:

  • Allow hide/show timings and reasoning content (saving user preferences to disk and reuse later)
  • Allow exporting/importing conversation
  • Support raw completion
  • Support remote media URL (downloaded from internet)
  • Show progress (in percentage) if prompt processing takes too long

TODO for console system:

  • Auto-completion for commands and file paths
  • Support "temporary display", for example clear loading messages when it's done

@github-actions github-actions bot added script Script related testing Everything test related examples devops improvements to build systems and github actions server labels Dec 6, 2025
@ngxson ngxson merged commit 6c21317 into ggml-org:master Dec 10, 2025
73 of 75 checks passed
@ggerganov
Copy link
Member

When we read a file, maybe we can start processing the prompt immediately? Just post a task with n_predict == 0, stream == false.

@ngxson

This comment was marked as outdated.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 10, 2025

Ah nevermind, I see what you mean. Hmm yeah that could also be a good idea

@bandoti
Copy link
Collaborator

bandoti commented Dec 10, 2025

@ngxson great work on this. You're lighting fast ⚡️!! 😊

@MB7979
Copy link

MB7979 commented Dec 11, 2025

I hope that llama-completion will remain available long term. It’s not possible to output to a file with this new CLI experience, or do raw completions, as well as a bunch of other things. Not everyone wants a chat experience. Would also be super helpful if llama-completion was documented somewhere.

Thank you for your work.

@andrew-aladev
Copy link
Contributor

Well, we are missing llama-completion in the Docker image; for now, we only have a way to build a full image with the .devops/tools.sh entrypoint. But .devops/tools.sh doesn't have llama-completion included.

Meanwhile, there is no way to disable this chat experience or interactive mode in order to get just raw output.

@andrew-aladev
Copy link
Contributor

andrew-aladev commented Dec 11, 2025

Hello @ngxson, what does it mean --no-conversation is not supported by llama-cli? So you mean that llama.cpp project itself is dropping regular inference capabilities and only keeps interactive chats (like web UI)? Why have you dropped regular inference to llama-completion and marked it as legacy? Is this an expected behavior of the llama.cpp project, and should users that don't want interactive mode leave llama.cpp? Or this is just a temporal issue and you plan to implement non-interactive mode later?

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 11, 2025

Friendly reminder that this is an open-source project and missing features can be added by contributors.

I won't comment further on missing features, there are already TODOs in the code for this purpose.

@andrew-aladev
Copy link
Contributor

I don't want to offend anyone, but I can just predict that I won't comment further is not possible, because you just have completely dropped the ability for users to catch LLM output (!). So users will create issues and make references to this pull request, and you will have to make comments anyway.

@bandoti
Copy link
Collaborator

bandoti commented Dec 11, 2025

@andrew-aladev please see my comments on the referenced issue. It is fine to reference this PR in issues but it is best to keep the conversation in those issues.

It is good that folks speak up if there's a need and file bug reports. These changes are needed to move forward because of technical debt in llama-cli that built up during evolution of the capabilities.

If there are any further issues regarding this please copy me on there and I will work on cataloging user needs.

@CISC
Copy link
Collaborator

CISC commented Dec 11, 2025

Well, we are missing llama-completion in the Docker image; for now, we only have a way to build a full image with the .devops/tools.sh entrypoint. But .devops/tools.sh doesn't have llama-completion included.

That is obviously an oversight that needs to be fixed, llama-completion should be included in the Docker images.

@tildebyte
Copy link

Guys (@andrew-aladev, @MB7979, others who complain first-listen later);

If you have the time to post here (multiple posts, even), you certainly have the time for due diligence, E.G.!! #17618, which starts off with

Important

For people coming here to complain about this breaking your workflow:

    - llama-completion is there and we won't remove it
    - Read https://github.com/ggml-org/llama.cpp/discussions/17618#discussioncomment-15233169 to understand why this move is needed 

This kind of thing is why open source maintainers[1] lose all of their hair, drop out, and decide to go to nursing school.

[1] Yes, I am one.

@MB7979
Copy link

MB7979 commented Dec 11, 2025

Guys (@andrew-aladev, @MB7979, others who complain first-listen later);

If you have the time to post here (multiple posts, even), you certainly have the time for due diligence, E.G.!! #17618, which starts off with

Important

For people coming here to complain about this breaking your workflow:

    - llama-completion is there and we won't remove it
    - Read https://github.com/ggml-org/llama.cpp/discussions/17618#discussioncomment-15233169 to understand why this move is needed 

This kind of thing is why open source maintainers[1] lose all of their hair, drop out, and decide to go to nursing school.

[1] Yes, I am one.

That message was edited to add that clarification 30 minutes ago. So you are chastising us for not finding something that didn’t exist when I posted and is in fact a direct response to said questions being raised.

For what it’s worth I did check PRs, issues, and discussions yesterday. I missed that discussion (it’s a few weeks old) and as soon as I found it I took my feedback there.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 11, 2025

@MB7979 Didn't you ignore this message that was added for a whole 2 weeks ago?

https://github.com/ngxson/llama.cpp/blob/3632271eb98541bd6fd726f4cd2a973d89a195ed/tools/main/main.cpp#L524-L528

    LOG_WRN("*****************************\n");
    LOG_WRN("IMPORTANT: The current llama-cli will be moved to llama-completion in the near future\n");
    LOG_WRN("  New llama-cli will have enhanced features and improved user experience\n");
    LOG_WRN("  More info: https://github.com/ggml-org/llama.cpp/discussions/17618\n");
    LOG_WRN("*****************************\n");

If you missed it, you have just implicitly proved that llama-cli UX was not very good.

@MB7979
Copy link

MB7979 commented Dec 11, 2025

I had not updated llama.cpp for a few weeks. I only did so on reading the new cli commit, as I was concerned the old functionality would be removed, which it was.

I really don’t understand the defensive tone being taken here. I’m not sure about the other poster, but my only intent was to ascertain whether llama.completion would be an ongoing part of the project, and to suggest some documentation to go with the changes to avoid people like me wasting your time with such questions.

I take it feedback is unwelcome here and I will not participate further.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 11, 2025

I really don’t understand the defensive tone being taken here

I am just speaking the truth.

and to suggest some documentation to go with the changes to avoid people like me wasting your time with such questions.

Then what is your suggestion? There was already a discussion and a notice inside llama-cli

I had not updated llama.cpp for a few weeks

I missed that discussion (it’s a few weeks old)

If we do have had better documentation, what will prevent you from missing it again?

(Again, I'm just speaking the truth here)

@MB7979
Copy link

MB7979 commented Dec 11, 2025

Include it with all the other examples, listed on the front page of this repository.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 11, 2025

Include it with all the other examples

you said earlier:

I had not updated llama.cpp for a few weeks


listed on the front page of this repository.

probably fair, I haven't touched the main README.md for a long time - even the last 2 big changes in llama-server and llama-cli weren't on the the list

@tildebyte
Copy link

@MB7979 TBH I was much more addressing @andrew-aladev than you. Apologies if I offended; I knew about these changes at least from the beginning of this week, but I was mistaken in how I knew 😬 (not unusual for me :) )

@bandoti
Copy link
Collaborator

bandoti commented Dec 11, 2025

I think for many folks who rely on llama-cli old features for now, no need to panic: llama-completions isn't going anywhere. It is the same old (legacy) llama-cli application with a new name. Just because it is deemed "legacy" does not mean it will be deleted 🙂

Lots of changes happen quickly in this codebase compared to other projects because of the interest in AI. It is sometimes hard to track everything happening, but it is important for everyone to try their best.

I will be creating a discussion in the following week to establish some of the user journeys and see if I can't come up with a roadmap of sorts.

Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Dec 12, 2025
* wip

* wip

* fix logging, add display info

* handle commands

* add args

* wip

* move old cli to llama-completion

* rm deprecation notice

* move server to a shared library

* move ci to llama-completion

* add loading animation

* add --show-timings arg

* add /read command, improve LOG_ERR

* add args for speculative decoding, enable show timings by default

* add arg --image and --audio

* fix windows build

* support reasoning_content

* fix llama2c workflow

* color default is auto

* fix merge conflicts

* properly fix color problem

Co-authored-by: bandoti <[email protected]>

* better loading spinner

* make sure to clean color on force-exit

* also clear input files on "/clear"

* simplify common_log_flush

* add warning in mtmd-cli

* implement console writter

* fix data race

* add attribute

* fix llama-completion and mtmd-cli

* add some notes about console::log

* fix compilation

---------

Co-authored-by: bandoti <[email protected]>
Copy link

@jsjtxietian jsjtxietian Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/ggml-org/llama.cpp/blame/master/tools/llama-bench/README.md has an outdated link to main/README.md, IMHO it should be updated too? Happy to help if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that would be great. If you find any broken links like that please feel free to submit a PR. Thank you!

Copy link
Contributor

@andrew-aladjev andrew-aladjev Dec 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bandoti Hello, I've added a PR #17993 it fixes dead links including link in llama-bench. Please review, thank you 😊

PS I am walking to shop, so accidentally made double post. Sorry!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a PR #17993 it fixes dead links including link in llama-bench

That's great!

wangqi pushed a commit to wangqi/llama.cpp that referenced this pull request Dec 14, 2025
Notable changes:
- Fix race conditions in threadpool (ggml-org#17748)
- New CLI experience (ggml-org#17824)
- Vision model improvements (clip refactor, new models)
- Performance fixes (CUDA MMA, Vulkan improvements)
- tools/main renamed to tools/completion

Conflict resolution:
- ggml-cpu.c: Use new threadpool->n_threads API (replaces n_threads_max),
  keep warning suppressed to reduce log noise

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions examples script Script related server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Better chat UX for llama-cli

10 participants