feat(llama-cpp): expose 12 missing common_params via options[] by localai-bot · Pull Request #9814 · mudler/LocalAI

localai-bot · 2026-05-13T22:50:21Z

Summary

A coverage audit of params_parse in backend/cpp/llama-cpp/grpc-server.cpp against upstream llama.cpp (pin 7f3f843c) flagged 12 user-visible common_params knobs that were neither set via the typed ModelOptions proto fields nor reachable through the existing options: array. This PR plumbs them through, following the file's existing patterns.

Newly accepted `options:` keys

Top-level / batching / IO

n_ubatch (alias ubatch) — physical batch size; previously force-aliased to n_batch, blocking embedding/rerank workloads needing independent control
threads_batch (alias n_threads_batch) — main-model batch threads, mirrors the existing draft_threads_batch
direct_io (alias use_direct_io) — O_DIRECT model loads
verbosity — llama.cpp log threshold (was commented out at line 479)
override_tensor (alias tensor_buft_overrides) — per-tensor buffer overrides for the main model, mirrors draft_override_tensor

Embedding / multimodal

pooling_type (alias pooling) — mean/cls/last/rank/none for embeddings (was only auto-flipped to RANK for rerankers)
embd_normalize (alias embedding_normalize) — -1/0/1/2/>2; the embedding handler also now reads params_base.embd_normalize instead of a hardcoded 2
mmproj_use_gpu (alias mmproj_offload) — keep mmproj on CPU
image_min_tokens / image_max_tokens — per-image vision token budget

Reasoning surface (audit-confirmed gaps)

LocalAI's existing ReasoningConfig.DisableReasoning only feeds the per-request chat_template_kwargs.enable_thinking body field at request time; it does not touch the load-time params.reasoning_format parser or params.enable_reasoning budget. Those remained unwired until this PR.

reasoning_format — none/auto/deepseek/deepseek-legacy parser for <think> blocks (default: deepseek)
enable_reasoning (alias reasoning_budget) — -1 unlimited / 0 off / >0 token budget
prefill_assistant — toggle the trailing-assistant-message prefill in the chat template

Compatibility

All 14 referenced fields/types exist on both the upstream pin (7f3f843c) and on the TheTom/llama-cpp-turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard is needed.

Docs

docs/content/advanced/model-configuration.md gets four new subsections:

Reasoning Models (load-time options + a note clarifying the per-request enable_thinking distinction)
Multimodal Backend Options
Embedding & Reranking Backend Options
Other Backend Tuning Options

The Speculative Type Values table is also refreshed to show the new dash-separated canonical names from upstream ggml-org/llama.cpp#22964 alongside the underscore aliases LocalAI still accepts.

Test plan

tests-llama-cpp-grpc CI job builds the backend cleanly (compile gate)
Spot-check a model config with options: [reasoning_format:deepseek, enable_reasoning:-1] against a DeepSeek-R1 GGUF and confirm reasoning blocks parse correctly
Spot-check a vision model with options: [mmproj_use_gpu:false] and confirm the projector loads on CPU (memory drops on the GPU)
Spot-check an embedding model with options: [pooling_type:mean, embd_normalize:0] and confirm the embedding shape/normalization changes
Smoke options: [n_ubatch:1024] on a rerank workload and confirm we no longer hit the input is too large to process 512-cap

The llama.cpp backend already accepts a free-form options: array in the model config that maps to common_params fields, but a coverage audit against upstream pin 7f3f843c flagged 12 user-visible knobs that were neither set via the typed proto fields nor reachable via options:. Wire them up under the existing if/else chain in params_parse, before the speculative section. Each new option follows the file's prevailing patterns (try/catch around numeric parses, the same true/1/yes/on bool form used elsewhere, hardware_concurrency() fallback for thread counts, mirror of draft_override_tensor for override_tensor). Top-level / batching / IO: - n_ubatch (alias ubatch) -- physical batch size; was previously force-aliased to n_batch at line 482, blocking embedding/rerank workloads that need independent control - threads_batch (alias n_threads_batch) -- main-model batch threads; mirrors the existing draft_threads_batch - direct_io (alias use_direct_io) -- O_DIRECT model loads - verbosity -- llama.cpp log threshold (line 479 had this commented out) - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer overrides for the main model; mirrors draft_override_tensor Embedding / multimodal: - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously only auto-flipped to RANK for rerankers - embd_normalize (alias embedding_normalize) -- and the embedding handler now reads params_base.embd_normalize instead of a hardcoded 2 at the previous embd_normalize literal in Embedding() - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU - image_min_tokens / image_max_tokens -- per-image vision token budget Reasoning surface (the audit-focus three; LocalAI's existing ReasoningConfig.DisableReasoning only feeds the per-request chat_template_kwargs.enable_thinking and does not touch any of these): - reasoning_format -- none/auto/deepseek/deepseek-legacy parser - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget - prefill_assistant -- trailing-assistant-message prefill toggle All 14 referenced fields exist on both the upstream pin and the turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard is needed. Docs: extend model-configuration.md with new "Reasoning Models", "Multimodal Backend Options", "Embedding & Reranking Backend Options", and "Other Backend Tuning Options" subsections; also refresh the Speculative Type Values table to show the new dash-separated canonical names alongside the underscore aliases LocalAI still accepts. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: claude-code:claude-opus-4-7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llama-cpp): expose 12 missing common_params via options[]#9814

feat(llama-cpp): expose 12 missing common_params via options[]#9814
localai-bot wants to merge 1 commit into
masterfrom
feat/llama-cpp-options-coverage

localai-bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 13, 2026

Summary

Newly accepted options: keys

Compatibility

Docs

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Newly accepted `options:` keys