feat(llama-cpp): expose 12 missing common_params via options[]#9814
Open
localai-bot wants to merge 1 commit into
Open
feat(llama-cpp): expose 12 missing common_params via options[]#9814localai-bot wants to merge 1 commit into
localai-bot wants to merge 1 commit into
Conversation
The llama.cpp backend already accepts a free-form options: array in the
model config that maps to common_params fields, but a coverage audit
against upstream pin 7f3f843c flagged 12 user-visible knobs that were
neither set via the typed proto fields nor reachable via options:.
Wire them up under the existing if/else chain in params_parse, before
the speculative section. Each new option follows the file's prevailing
patterns (try/catch around numeric parses, the same true/1/yes/on bool
form used elsewhere, hardware_concurrency() fallback for thread counts,
mirror of draft_override_tensor for override_tensor).
Top-level / batching / IO:
- n_ubatch (alias ubatch) -- physical batch size; was previously
force-aliased to n_batch at line 482, blocking embedding/rerank
workloads that need independent control
- threads_batch (alias n_threads_batch) -- main-model batch threads;
mirrors the existing draft_threads_batch
- direct_io (alias use_direct_io) -- O_DIRECT model loads
- verbosity -- llama.cpp log threshold (line 479 had this commented
out)
- override_tensor (alias tensor_buft_overrides) -- per-tensor buffer
overrides for the main model; mirrors draft_override_tensor
Embedding / multimodal:
- pooling_type (alias pooling) -- mean/cls/last/rank/none; previously
only auto-flipped to RANK for rerankers
- embd_normalize (alias embedding_normalize) -- and the embedding
handler now reads params_base.embd_normalize instead of a hardcoded
2 at the previous embd_normalize literal in Embedding()
- mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU
- image_min_tokens / image_max_tokens -- per-image vision token budget
Reasoning surface (the audit-focus three; LocalAI's existing
ReasoningConfig.DisableReasoning only feeds the per-request
chat_template_kwargs.enable_thinking and does not touch any of these):
- reasoning_format -- none/auto/deepseek/deepseek-legacy parser
- enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget
- prefill_assistant -- trailing-assistant-message prefill toggle
All 14 referenced fields exist on both the upstream pin and the
turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard
is needed.
Docs: extend model-configuration.md with new "Reasoning Models",
"Multimodal Backend Options", "Embedding & Reranking Backend Options",
and "Other Backend Tuning Options" subsections; also refresh the
Speculative Type Values table to show the new dash-separated canonical
names alongside the underscore aliases LocalAI still accepts.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A coverage audit of
params_parseinbackend/cpp/llama-cpp/grpc-server.cppagainst upstream llama.cpp (pin7f3f843c) flagged 12 user-visiblecommon_paramsknobs that were neither set via the typedModelOptionsproto fields nor reachable through the existingoptions:array. This PR plumbs them through, following the file's existing patterns.Newly accepted
options:keysTop-level / batching / IO
n_ubatch(aliasubatch) — physical batch size; previously force-aliased ton_batch, blocking embedding/rerank workloads needing independent controlthreads_batch(aliasn_threads_batch) — main-model batch threads, mirrors the existingdraft_threads_batchdirect_io(aliasuse_direct_io) —O_DIRECTmodel loadsverbosity— llama.cpp log threshold (was commented out at line 479)override_tensor(aliastensor_buft_overrides) — per-tensor buffer overrides for the main model, mirrorsdraft_override_tensorEmbedding / multimodal
pooling_type(aliaspooling) —mean/cls/last/rank/nonefor embeddings (was only auto-flipped to RANK for rerankers)embd_normalize(aliasembedding_normalize) —-1/0/1/2/>2; the embedding handler also now readsparams_base.embd_normalizeinstead of a hardcoded2mmproj_use_gpu(aliasmmproj_offload) — keep mmproj on CPUimage_min_tokens/image_max_tokens— per-image vision token budgetReasoning surface (audit-confirmed gaps)
LocalAI's existing
ReasoningConfig.DisableReasoningonly feeds the per-requestchat_template_kwargs.enable_thinkingbody field at request time; it does not touch the load-timeparams.reasoning_formatparser orparams.enable_reasoningbudget. Those remained unwired until this PR.reasoning_format—none/auto/deepseek/deepseek-legacyparser for<think>blocks (default:deepseek)enable_reasoning(aliasreasoning_budget) —-1unlimited /0off />0token budgetprefill_assistant— toggle the trailing-assistant-message prefill in the chat templateCompatibility
All 14 referenced fields/types exist on both the upstream pin (
7f3f843c) and on theTheTom/llama-cpp-turboquantfork'scommon.h, so noLOCALAI_LEGACY_LLAMA_CPP_SPECguard is needed.Docs
docs/content/advanced/model-configuration.mdgets four new subsections:enable_thinkingdistinction)The Speculative Type Values table is also refreshed to show the new dash-separated canonical names from upstream
ggml-org/llama.cpp#22964alongside the underscore aliases LocalAI still accepts.Test plan
tests-llama-cpp-grpcCI job builds the backend cleanly (compile gate)options: [reasoning_format:deepseek, enable_reasoning:-1]against a DeepSeek-R1 GGUF and confirm reasoning blocks parse correctlyoptions: [mmproj_use_gpu:false]and confirm the projector loads on CPU (memory drops on the GPU)options: [pooling_type:mean, embd_normalize:0]and confirm the embedding shape/normalization changesoptions: [n_ubatch:1024]on a rerank workload and confirm we no longer hit theinput is too large to process512-cap