Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.112.1

20 Apr 12:04

Choose a tag to compare

koboldcpp-1.112.1

Finally made it edition

image

KoboldCpp is now finally the top fork in the original list of linked llama.cpp forks! We have finally crossed 10k stars and overtaken alpaca.cpp (and it only took us 3 years to catch up)

  • NEW: Added support for AceStepXL models.
    • AceStep XL uses the same AceStep LM, Embedder and VAE as AceStep1.5 which you can get here, (implementation referenced from @ServeurpersoCom)
    • Also fixed a bug affecting music quality on Vulkan, and further reduced memory footprint in --musiclowvram mode
  • NEW: Added support for reasoning budget/reasoning effort - This is now supported when generating with a thinking model over the API. Pass the field reasoning_effort to set the budget, supported values are high, medium, low, minimal and none. If unspecified, no reasoning effort budget is enforced.
    • In KoboldAI Lite, this control can be found in Settings > Tokens > Thinking > Reasoning Effort
    • If you're using a third party frontend, it should be settable from their settings as reasoning_effort is a known payload field name.
    • Otherwise, you can also set it manually from KoboldCpp by passing it from the launcher as a Default Param, e.g. --gendefaults '{\"reasoning_effort\":\"minimal\"}' or simply setting Default Params to'{"reasoning_effort":"minimal"} in the GUI launcher.
  • NEW: Added the --swapadding parameter: Do you want to use SWA but you find the SWA window too small? This allows you to extend it while still keeping a relatively small KV memory footprint. Extends the SWA ctx by specified number of tokens.
  • NEW: Added support for q5_1 KV cache (Breaking Change) - Now you should specify --quantkv with the cache type instead, e.g. --quantkv q5_1. Valid values are f16/bf16/q8_0/q5_1/q4_0. The old single digit values are considered deprecated, avoid using them.
  • NEW: Streaming now works along with Jinja tool calling when using --jinjatools.
  • Fixed a potential incoherent state when attempting to rewind too far while SWA is enabled. If you had weird outputs with both FastForward and SWA enabled, this might fix it. If not, disable one of them or increase SWA padding.
  • Added --baseconfig, allowing a base config to be pre-loaded on every model swap. The config will be merged with the config you are attempting to load. This can be overridden by passing a baseconfig parameter over the /api/admin/reload_config API.
  • Added --image-min-tokens and --image-max-tokens flags to allow setting min/max vision tokens for gemma4, similar to llama.cpp, credits @pi6am
  • Gemma4 E4B and E2B now support audio inputs.
  • Added --jinjatemplate / --chat-template-file - This allows you to replace the Jinja template in your model with a custom template.
  • Increased multiuser default from 7 to 10.
  • Autoswap: fixed some edge conditions
  • Post-Generate summary added processed tokens count.
  • Fix for /api/extra/tokencount, also allow input as OpenAI messages instead of a raw prompt. Will return compiled prompt.
  • Improvements to image handling in chat completions.
  • Fixed a crash with very large --preloadstory
  • Multiple Jinja tool calling fixes and improvements.
    • Jinja tool calling improved for GPT-OSS, Qwen3, Qwen3.5, GLM models and Gemma4 models.
    • If you notice any tool call parsing issues with a model, please report them.
    • Reminder: Use --jinjatools to enable jinja template for tool calling (better quality). With that, tool calling should work optimally.
  • Adjusted gemma4 fallback handling, handle new gemma4 templates
  • Updated image generation from upstream, fixes to sampler handling by @wbruna
  • Chunk qwen3tts inputs longer than 1024 frames into multiple batches. This should allow for longer Qwen3TTS generation lengths.
  • Updated Kobold Lite, multiple fixes and improvements, especially with thinking rendering.
  • Merged fixes, new model support, and improvements from upstream

Hotfix 1.112.1 - Fixed fixed vision max/min when one param is missing, fixed processing count wrong, updated lite, updated colab, updated from upstream, fixed router mode

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Newer rolling experimental builds can be found here, these are auto-updated and may be unstable.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.111.2

03 Apr 18:13

Choose a tag to compare

koboldcpp-1.111.2

  • Gemma 4 models are now supported. Note that Gemma 4 is very format sensitive, using the wrong format will likely cause bad outputs. If you want to ensure the correct format is used (chat completions mode), you can use it with --jinja. Otherwise, the default AutoGuess template will be non-thinking by default. Vision is supported.
    • Recommended variants: gemma-4-E4B for smaller devices, or gemma-4-26B-A4B for larger devices. Vision mmprojs can be found here.
    • If running inside KoboldAI Lite, Separate end tags toggle is recommended, or use Jinja. Also, it really only works well in Instruct mode.
    • Upstream llama.cpp forces SWA by default for this model. Here, you can optionally enable it with --useswa. While we give you this flexibility the model uses significantly less vram when SWA is enabled.
  • This release contains the TurboQuant inspired activation rotations from upstream, there is no option to turn this on its automatically used in the relevant scenarios (like upstream).
  • NEW: Qwen3 TTS CustomVoice and VoiceDesign are now supported! This allows for creation of narration with instructions describing the voices.
    • Download Q3TTS VoiceDesign and the wav tokenizer.
    • When creating your TTS prompt, add instructions at the start in square brackets: e.g. [A depressed woman is crying and sobbing] I want to go home!
    • Alternatively, use the included musicui at http://localhost:5001/musicui and select TTS tab.
    • Implementation referenced from @paoletto qwen3tts fork.
  • NEW: Added basic /v1/responses and /v1/messages compatibility API support.
  • Added fixes for Jinja based tool calling, which should work for more models now. Enable with --jinjatools. If not enabled, universal tool calling is used instead.
  • Added support for BF16 KV type, select it with --quantkv 3 or in the GUI launcher.
  • Added a non-thinking AutoGuess template
  • Added config overwriting for admin mode, you can now specify 2 config files on admin API reload (one base and one target) and koboldcpp will combine them both before switching.
  • Fixed jinja prefills for chat completions
  • Added support for Jinja chat template Kwargs like in llama.cpp, use --jinja-kwargs or --chat-template-kwargs just like in llama.cpp e.g. --chat-template-kwargs '{"enable_thinking":false}'
  • When using pipeline parallel, the logical batch size is doubled (physical batch size is unchanged), this will improve performance on multi-gpu setups.
  • Added a section for popular community models in the help button menu. If you wish to contribute a suggestion, please prepare a .kcppt template and submit it to the Discussions page.
  • Added --autoswap functionality, when running multi-feature configs e.g. Text+Images+Music, this allows swapping the currently loaded feature on and off, for each request type, saving VRAM. Requires router mode enabled. (credits: @esolithe)
  • Credentials can now be optionally supplied by environment variables KCPP_ADMINPASSWORD and KCPP_PASSWORD during launch from command line (thanks @shoaib42)
  • Image Gen: Added --sdmaingpu allowing image models to be independently placed on any gpu
  • Image Gen: ESRGAN passthrough added, upscale-only mode can be done with img2img and denoise 0.0 with 1 step
  • Image Gen: Return metadata and upstream updates by @wbruna
  • Music Gen: Fixed stop tokens by @dysangel
  • Music Gen: Added planner mode that uses your main LLM to generate better lyrics instead, toggle in musicUI advanced settings.
  • Music Gen: Added API key support
  • TTS Gen: Allow embedded music UI to do both music and TTS generations (2 tabs)
  • Fixes for Colab including image gen model templates in the template dropdown
  • Fixed CPU incorrect selection in OldCPU mode.
  • WSL socket timeout fix, thanks @scottf007
  • Router mode can now auto-wake a few other endpoints if put to sleep by auto-unload
  • Increased max vision image limit
  • Increase GUI launcher max context size slider limit
  • Breaking Change: Detected thinking content is now sent via reasoning_content instead of content over the chat completions API, to align with most other providers. To disable this behavior, set encapsulate_thinking to false in your request.
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes, new model support, and improvements from upstream

Hotfix 1.111.1:

  • Fixed -ncmoe aka --moecpu layers for Gemma 4
  • Autoswap fixes from @esolithe
  • Gemma 4 Reasoning content visibility fix for chat completions
  • Updated gemma4 autoguess templates, updated lite, BOS fix for gemma 4.
  • Multiple fixes to Gemma 4 handling. It should work a lot better in most cases now, though note that it's still very format sensitive. I have received many requests on how to get it to work with both thinking and non-thinking in SillyTavern, so here is a simple guide.

Extra clarification since this is frequently asked about. For KoboldCpp the -np 1 option is not needed, if you have a large KV cache on KoboldCpp versus other solutions this is likely because you did not enable SWA. We give you the freedom to have it disabled by default so that Context Shift can work. But if you'd like efficiency with Gemma4 it is a must that you turn this option on.

Hotfix 1.111.2 - Added fallback formatting from @henk717 to support other instruct formats for gemma 4, make autofit more accurate if SWA is enabled, updated lite to handle gemma4 thinking, fixed a chat completions wrapper bug that failed to match think tokens.

image ---

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Newer rolling experimental builds can be found here, these are auto-updated and may be unstable.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.110

19 Mar 08:02

Choose a tag to compare

koboldcpp-1.110

KoboldCpp 3 Year Anniversary Edition

PleadBoy.mp4
  • NEW: OpenAI Compatible Router Mode: - Automatic model and config hotswapping is finally available in the OpenAI Compatible API. Note that this functions differently from the llama.cpp version, it's more like llama-swap, allowing you to perform full config-reloads similar to the existing admin endpoint, but also within the existing request and response via a reverse proxy. Requires admin mode enabled. Enable it with --routermode. Streaming is supported with a small delay.
    • Model swapping now has an extra option "initial_model" which was the model that was originally loaded.
  • NEW: Auto Unload Timeout - Unloads the existing config (unloading all models) after a specified number of seconds. Works best with router mode to allow for auto reloading. Can also manually reload with admin endpoint.
  • NEW: Qwen3TTS now supports the 1.7B TTS model, with even better voice quality and voice cloning!
  • AceStep 1.5 Music Generation Improvements: Better quality, Reference Audio uploads are now supported, Mp3 outputs (ported from @ServeurpersoCom's MIT mp3 implementation), better LM defaults allowing audio-code generation to work better, make stereo output default. Recommended .kcppt template for 6GB users
  • Qwen3TTS loader improvements, supports --ttsgpu toggle, vulkan speed improvements for Qwen3TTS (cuda is still slow)
  • NEW: Improved Ollama Emulation - Can now handle requests from endpoints that only support streaming (buffers responses). However, OpenAI endpoint is still recommended if supported.
  • New: Multiple dynamic LoRAs: --sdlora now supports specifying directories as well. All the image LoRAs there will be loadable at runtime by using the LoRA syntax in your image generation prompt in the form <lora:filename:multiplier>. Also, merged multiple fixes and updates from upstream, include optional cache mode. Big thanks to @wbruna for the contributions.
  • NEW: Revamped Colab Cloud Notebook: The official KoboldCpp colab notebook has been updated and reworked. Music generation is now enabled, and image gen and text gen can now be used separately.
  • MCP improvements: Added notification support, now can handle simultaneous STDIO requests and request with multiple parts.
  • Adjustments to forcing --autofit, now disables moecpu and overridetensors automatically if used together.
  • Disable smartcache if slots is zero. Improved smartcache snapshot logic to use conserve slots.
  • Add warning that RNN models currently do not support anti-slop sampler.
  • Fixed some single token phrase bans not registering
  • OpenAI compatible endpoints now have dynamics IDs and reflect token usage accurately (thanks @gustrd)
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes, new model support, and improvements from upstream, including Nemotron support and Qwen3.5 improvements.

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Newer rolling experimental Builds can be found at https://github.com/LostRuins/koboldcpp/releases/tag/rolling

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.109.2

03 Mar 11:50
3897730

Choose a tag to compare

koboldcpp-1.109.2

image
  • SmartCache improvements - SmartCache should work better for RNN/hybrid models like Qwen 3.5 now. Additionally, smartcache is automatically enabled when using such models for a smoother experience, unless fast forwarding is disabled.
  • NEW: Added experimental support for Music Generation via Ace Step - KoboldCpp now optionally supports generating music natively in as little as 4GB of VRAM, thanks to @ServeurpersoCom's acestep.cpp.
    • Requires 4 files (AceStep LM, diffusion, embedder and VAE which are found https://huggingface.co/koboldcpp/music/tree/main), but for your convenience we made templates for 6GB of vram (recommended option) 1.7B LM. (you can also try alternative templates for 4GB of vram here, and 4B 6GB (Tight fit, not recommended as a first option), 4B 8GB of vram and 4B 10GB of vram
    • When used, a brand new UI has been added at http://localhost:5001/musicui
    • New CLI args added --musicllm --musicdiffusion --musicembeddings --musicvae and --musiclowvram
    • To keep KoboldCpp lightweight our implementation re-uses the existing GGML libraries from Llamacpp, we are currently waiting on ace-step.cpp to upstream its GGML improvements.
    • As usual the ace-step specific backend components are only loaded if you are trying to load a music generation model, if you only wish to use KoboldCpp for text generation this addition does not impact your performance or memory usage.
  • NEW: Added Qwen3-TTS support with high quality voice cloning - Finally, support for qwen3tts has been added from @predict-woo's qwen3-tts.cpp, this allows for high quality voice cloning at the level of XTTS, and much better than the OuteTTS one.
    • You'll need the TTS model and the qwen3tts tokenizer, remember to also specify a TTS directory if you want to use voice cloning.
    • Specify a directory of short voice audio samples (.mp3 or .wav) with --ttsdir, you'll be able to use TTS narration with those voices.
    • For the fastest generation speed use Vulkan.
  • Fix follow-up tool call check with assistant prefills
  • Fixed image importing in SDUI
  • Config packing improvements as a minor sd.cpp update from @wbruna
  • Fixed a wav header packing issue that could cause a click in output audio
  • Relaxes size restrictions in image gen, also support high res reference images.
  • --admindir now also indexes subdirectories up to 1 level deep.
  • Show timestamps when image gen is completed
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes, new model support, and improvements from upstream

Hotfix 1.109.1 - Optimize the batch splitting for RNN models, fixed moecpu interaction with autofit, added music gen documentation, fixed a SSE streaming repeat bug in chat completions

Hotfix 1.109.2 - Added support for importing SillyTavern JSONL exports, fixed Qwen3TTS to allow running without GPU unless TTS GPU selected, added ComfyUI auth token support (from @RubenGarcia)

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.108.2

18 Feb 09:06

Choose a tag to compare

koboldcpp-1.108.2

  • Try to fix broken pipe errors due to timeouts during long tool calls
  • Updated SDUI, added toggle to send img2img as a reference.
  • Added ollama /api/show endpoint emulation
  • Try to fix autofit on rocm going oom
  • Improved MCP behavior with multipart content
  • Prevent swapping config at runtime from changing the download directory
  • Adjust GUI for fractional scaling
  • Fix output filenames incorrect path in some cases
  • llama.cpp UI handling of common think tags.
  • --autofit mode now hides the GUI layers selector
  • Fixed extra spam from autofit mode
  • Autofit toggle is now in the Quick Launch menu
  • Autofit is now triggered if -1 gpulayers (default) is selected and tensor splits or tensor overrides are not set. Setting your own GPU layers overrides this behavior
  • Now allow Image Gen soft limit to be overridden to 2048x2048 if user chooses. Note that this may crash if you don't know what you're doing.
  • Updated upstream stable-diffusion.cpp by @wbruna
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes, new model support, and improvements from upstream

Hotfix 1.108.1 - Fix DPI handling issues, fixed wrong backend selected in some cases, added support for loading multiple image LoRAs
Hotfix 1.108.2 - Fixed OuteTTS broken audio, fixed cuda graph memory leak

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.107.3

31 Jan 04:53

Choose a tag to compare

koboldcpp-1.107.3

down comes the claw edition

doom
  • Added a new option for Vulkan (Older PC) in the oldpc builds. This provides GPU support via Vulkan without any CPU intrinsics (no AVX2, no AVX). This replaces the removed CLBlast options.
  • Breaking Changes:
    • Pipeline parallel is enabled by default now in CLI. Disable it in the launcher or with --nopipelineparallel
    • Flash attention is enabled by default now in CLI. Disable it in the launcher or with --noflashattention
  • Added a few fixes for GLM 4.7 Flash. Note that this model is extremely sensitive to rep-pen, recommend disabling rep pen when using it. Make sure you use a fixed gguf model as some early quants were broken. It may be helpful to use the GLM4.5 NoThink template, or enable forced thinking if you desire it.
  • Fixes for mcp.json importing and MCP tool listing handshake (thanks @Rose22)
  • Changed MCP user agent string as some sites were blocking it.
  • Added the fractional scaling workaround fix for the GUI launcher for KDE on Wayland.
  • Added support for SDXS, a really fast Stable Diffusion Image Generation model. This model is so fast that it can generate images on pure CPU in under 10 seconds on a raspberry Pi. Running it on GPU allows generating images in under half a second. An excellent way to get image generation if you do not have a GPU. For convenience, a GGUF quant of SDXS is provided here.
  • Added support for ESRGAN 4x upscaler. Load this as an upscaler model to be able to upscale your generated images.
  • Merged Image Gen improvements and Flux Klein model support from upstream (thanks @wbruna). Get Flux Klein's image model, VAE and text encoder.
  • Added TAE SD support for Flux2, enable with --sdvaeauto.
  • Increase image generation hard total resolution limit from 1 megapixel to 1.6 megapixels.
  • Updated SDUI with some quality of life fixes by @Riztard
  • Updated Kobold Lite, multiple fixes and improvements
    • Added even more themes from @Rose22
    • Added experimental TTS chunked streaming mode (works for all TTS APIs)
    • Added customizable sampler presets from @lubumbax
    • Removed manual admin state caching panel since it's made obsolete by --smartcache. The API still exists but should be unnecessary.
  • Merged fixes, model support, and improvements from upstream, including Vulkan speedup from occam's coopmat1 optimization. Coopmat1 is used by GPU's with matrixcores such as the 7000 and 9000 series AMD GPU's.

Important Notice: The CLBlast backend is fully deprecated and has been REMOVED as of this version. If you require CLBlast, you will need to use an earlier version.

Hotfix 1.107.1 - SDUI improvements, Flux2 Image Editing support, MCP cert validation fixes, KDE scaling fix, Z-Image cfg clamp increased, reduce cuda graph spam, updated lite with minor refactors.

Hotfix 1.107.2 - This was grouped into a hotfix as 1.107.1 was unstable. Though this release is larger and out-of-band, you're encouraged to update to it from 1.107/1.107.1 for stability reasons. Barring unforeseen circumstances, the next major release will likely be delayed.

  • Scaling fixes for some linux desktops
  • Updated SDUI and sdcpp
  • Template parser fix from @Reithan
  • Added "error" as a possible stop reason (e.g. backend failed to generate).
  • Fixed SSE parsing in MCP
  • Added GLM4.7-NoThink adapter template
  • NEW: Reworked newbie help menu, added simple configs they can use
  • NEW: Added optional --downloaddir to specify where model downloads are stored for URL references.
  • Fixed GLM4 and GLM4.7 Flash coherency after shifting issues, ref ggml-org#19292

Hotfix 1.107.3 - Made on special request, merging upstream support for Step 3.5 Flash and Kimi Linear

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.106.2

17 Jan 04:33

Choose a tag to compare

koboldcpp-1.106.2

MCP for the masses edition

image
  • NEW: MCP Server and Client Support Added to KoboldCpp - KoboldCpp now supports running an MCP bridge that serves as a direct drop-in replacement for Claude Desktop.
    • KoboldCpp can connect to any HTTP or STDIO MCP server, using a mcp.json config format compatible with Claude Desktop.
    • Multiple servers are supported, KoboldCpp will automatically combine their tools and dispatch request appropriately.
    • Recommended guide for MCP newbies: Here is a simple guide on running a Filesystem MCP Server to let your AI browse files locally on your PC and search the web - https://github.com/LostRuins/koboldcpp/wiki#mcp-tool-calling
    • CAUTION: Running ANY MCP SERVER gives it full access to your system. Their 3rd party scripts will be able to modify and make changes to your files. Be sure to only run servers you trust!
    • The example music playing MCP server used in the screenshot above was this audio-player-mcp
  • Flash Attention is now enabled by default when using the GUI launcher.
  • Improvements to tool parsing (thanks @AdamJ8)
  • API field continue_assistant_turn is now enabled by default in all chat completions (assistant prefill)
  • Interrogate image max length increased
  • Various StableUI fixes by @Riztard
  • Using the environment variable GGML_VK_VISIBLE_DEVICES externally now always overrides whatever vulkan device settings set from KoboldCpp.
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: Full settings UI overhaul from @Rose22, the settings menu is now much cleaner and more organized. Feedback welcome!
    • NEW: Added 4 new OLED themes from @Rose22
    • Improved performance when editing massive texts
    • General cleanup and multiple minor adjustments
    • Browser MCP implementation adapted from @ycros simple-mcp-client
  • Merged fixes, model support, and improvements from upstream

Hotfix 1.106.1 - Allow overriding selected GPU devices directly with --device e.g. --device Vulkan0,Vulkan1, Updated lite
Hotfix 1.106.2 - Increase logprobs from 5 to 10, fixed memory usage with embeddings, allow device override to be set in gui (thanks @pi6am)

Important Notice: The CLBlast backend may be removed soon, as it is very outdated and no longer receives and updates, fixes or improvements. It can be considered superceded by the Vulkan backend. If you have concerns, please join the discussion here.

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.105.4

02 Jan 05:21

Choose a tag to compare

koboldcpp-1.105.4

new year edition

oof
  • NEW: Added --gendefaults, accepts a JSON dictionary where you can specify any API fields to append or overwrite (e.g. step count, temperature, top_k) on incoming payloads. Incoming API payloads will have this modification applied. This can be useful when using frontends that don't behave well, as you will be able to override or correct whatever fields they send to koboldcpp.
    • Note: If this marks the horde worker with a debug flag if used on AI Horde.
    • --sdgendefaults has been deprecated and merged into this flag
  • Added support for a new "Adaptive-P" sampler by @MrJackSpade, a sampler that allows selecting lower probability tokens. Recommended to use together with min-P. Configure with adaptive target and adaptive decay parameters. This sampler may be subject to change in future.
  • StableUI SDUI: Fixed generation queue stacking, allowed requesting AVI formatted videos (enable in settings first), added a dismiss button, various small tweaks
  • Minor fixes to tool calling
  • Added support for Ovis Image and new Qwen Image Edit, added support for TAEHV for WAN VAE (you can use it with Wan2.2 videos and Qwen Image/Qwen Image Edit, simply enable "TAE SD" checkbox or --sdvaeauto, greatly saves memory), thanks @wbruna for the sync.
  • Fixed LoRA loading issues with some Qwen Image LoRAs
  • --autofit now allocates some extra space if used with multiple models (image gen, embeddings etc)
  • Improved snapshotting logic with --smartcache for RNN models.
  • Attempted to fix tk scaling on some systems.
  • Renamed KCPP launcher's Tokens tab to Context, moved Flash Attention toggle into hardware tab
  • Updated Kobold Lite, multiple fixes and improvements
    • Added support for using remote http MCP servers for tool calling. KoboldCpp based MCP may be added at a later date.
  • Merged fixes, model support, and improvements from upstream

Hotfix 1.105.1 - Allow configuring number of smartcache slots, updated lite + SDUI, handle tool calling images from remote MCP responses.
Hotfix 1.105.2 - Fixed various minor bugs, allow transcribe to be used with an LLM with audio projector.
Hotfix 1.105.3 - Merged fix for CUDA MoE CPU regression
Hotfix 1.105.4 - Merged vulkan glm4.6 fix

Important Notice: The CLBlast backend may be removed soon, as it is very outdated and no longer receives and updates, fixes or improvements. It can be considered superceded by the Vulkan backend. If you have concerns, please join the discussion here.

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build for best support.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.104

20 Dec 09:07

Choose a tag to compare

koboldcpp-1.104

calm before the storm edition

  • NEW: Added --smartcache adapted from @Pento95: This is a 2-in-1 dynamic caching solution that intelligently creates KV state snapshots automatically. Read more here
    • This will greatly speed up performance when different contexts are swapped back to back (e.g. hosting on AI Horde or shared instances).
    • Also allows snapshotting when used with a RNN or Hybrid model (e.g. Qwen3Next, RWKV) which avoids having to reprocess everything.
    • Reuses the KV save/load states from admin mode. Max number of KV states increased to 6.
  • NEW: Added --autofit flag which utilizes upstream's "automatic GPU fitting (-fit )" behavior from ggml-org#16653. Note that this flag overwrites all your manual layer configs and tensor overrides and is not guaranteed to work. However, it can provide a better automatic fit in some cases. Will not be accurate if you load multiple models e.g. image gen.
  • Pipeline parallelism is no longer the default, instead its now a flag you can enable with --pipelineparallel. Only affects multi-gpu setups, faster speed at the cost of memory usage.
  • Key Improvement - Vision Bugfix: A bug in mrope position handling has been fixed, which improves vision models like Qwen3-VL. You should now see much better visual accuracy in some multimodal models compared to earlier koboldcpp versions. If you previously had issues with hallucinated text or numbers, it should be much better now.
  • Increased default gen amount from 768 to 896.
  • Deprecated obsolete --forceversion flag.
  • Fixed safetensors loading for Z-Image
  • Fixed image importer in SDUI
  • Capped cfg_scale to max 3.0 for Z-Image to avoid blurry gens. If you want to override this, set remove_limits to 1 in your payload or inside --sdgendefaults.
  • Removed cc7.0 as a CUDA build target, Volta (V100) will fall back to PTX from cc6.1
  • Tweaked branding in llama.cpp UI to make it clear it's not llama.cpp
  • Added indentation to .kcpps configs
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes and improvements from upstream

Important Notice: The CLBlast backend may be removed soon, as it is very outdated and no longer receives and updates, fixes or improvements. It can be considered superceded by the Vulkan backend. If you have concerns, please join the discussion here.

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build for best support.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.103

04 Dec 12:29

Choose a tag to compare

koboldcpp-1.103

image
  • NEW: Added support for Flux2 and Z-Image Turbo! Another big thanks to @leejet for the sd.cpp implementation and @wbruna for the assistance with testing and merging.
    • To obtain models for Z-Image (Most recommended, lightweight):
    • To obtain models for Flux2 (Not recommended as this model is huge so i will link the q2k. Remember to enable cpu offload. Running anything larger requires a very powerful GPU):
      • Get the Flux 2 Image model here
      • Get the Flux 2 VAE here
      • Get the Flux 2 text encoder here, load this as Clip 1
  • NEW: Mistral and Ministral 3 model support has been merged from upstream.
  • Improved "Assistant Continue" in llama.cpp UI mode, now can be used to continue partial turns.
    • We have added prefill support to chat completions if you have /lcpp in your URL (/lcpp/v1/chat/completions), the regular chat completions is meant to mimick OpenAI and does not do this. Point your frontend to the URL that most fits your use case, we'd like feedback on which one of these you prefer and if the /lcpp behavior would break an existing use case.
  • Minor tool calling fix to avoid passing base64 media strings into the tool call.
  • Tweaked resizing behavior of the launcher UI.
  • Added a secondary terminal UI to view the console logging (only for Linux), can be used even when not launched from CLI. Launch this auxiliary terminal from the Extras tab.
  • AutoGuess Template fixes for GPT-OSS and Kimi
  • Fixed a bug with --showgui mode being saved into some configs
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes and improvements from upstream

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build for best support.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.