Skip to content

feat(llama-cpp): expose 12 missing common_params via options[]#9814

Open
localai-bot wants to merge 1 commit into
masterfrom
feat/llama-cpp-options-coverage
Open

feat(llama-cpp): expose 12 missing common_params via options[]#9814
localai-bot wants to merge 1 commit into
masterfrom
feat/llama-cpp-options-coverage

Conversation

@localai-bot
Copy link
Copy Markdown
Collaborator

Summary

A coverage audit of params_parse in backend/cpp/llama-cpp/grpc-server.cpp against upstream llama.cpp (pin 7f3f843c) flagged 12 user-visible common_params knobs that were neither set via the typed ModelOptions proto fields nor reachable through the existing options: array. This PR plumbs them through, following the file's existing patterns.

Newly accepted options: keys

Top-level / batching / IO

  • n_ubatch (alias ubatch) — physical batch size; previously force-aliased to n_batch, blocking embedding/rerank workloads needing independent control
  • threads_batch (alias n_threads_batch) — main-model batch threads, mirrors the existing draft_threads_batch
  • direct_io (alias use_direct_io) — O_DIRECT model loads
  • verbosity — llama.cpp log threshold (was commented out at line 479)
  • override_tensor (alias tensor_buft_overrides) — per-tensor buffer overrides for the main model, mirrors draft_override_tensor

Embedding / multimodal

  • pooling_type (alias pooling) — mean/cls/last/rank/none for embeddings (was only auto-flipped to RANK for rerankers)
  • embd_normalize (alias embedding_normalize) — -1/0/1/2/>2; the embedding handler also now reads params_base.embd_normalize instead of a hardcoded 2
  • mmproj_use_gpu (alias mmproj_offload) — keep mmproj on CPU
  • image_min_tokens / image_max_tokens — per-image vision token budget

Reasoning surface (audit-confirmed gaps)

LocalAI's existing ReasoningConfig.DisableReasoning only feeds the per-request chat_template_kwargs.enable_thinking body field at request time; it does not touch the load-time params.reasoning_format parser or params.enable_reasoning budget. Those remained unwired until this PR.

  • reasoning_formatnone/auto/deepseek/deepseek-legacy parser for <think> blocks (default: deepseek)
  • enable_reasoning (alias reasoning_budget) — -1 unlimited / 0 off / >0 token budget
  • prefill_assistant — toggle the trailing-assistant-message prefill in the chat template

Compatibility

All 14 referenced fields/types exist on both the upstream pin (7f3f843c) and on the TheTom/llama-cpp-turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard is needed.

Docs

docs/content/advanced/model-configuration.md gets four new subsections:

  • Reasoning Models (load-time options + a note clarifying the per-request enable_thinking distinction)
  • Multimodal Backend Options
  • Embedding & Reranking Backend Options
  • Other Backend Tuning Options

The Speculative Type Values table is also refreshed to show the new dash-separated canonical names from upstream ggml-org/llama.cpp#22964 alongside the underscore aliases LocalAI still accepts.

Test plan

  • tests-llama-cpp-grpc CI job builds the backend cleanly (compile gate)
  • Spot-check a model config with options: [reasoning_format:deepseek, enable_reasoning:-1] against a DeepSeek-R1 GGUF and confirm reasoning blocks parse correctly
  • Spot-check a vision model with options: [mmproj_use_gpu:false] and confirm the projector loads on CPU (memory drops on the GPU)
  • Spot-check an embedding model with options: [pooling_type:mean, embd_normalize:0] and confirm the embedding shape/normalization changes
  • Smoke options: [n_ubatch:1024] on a rerank workload and confirm we no longer hit the input is too large to process 512-cap

The llama.cpp backend already accepts a free-form options: array in the
model config that maps to common_params fields, but a coverage audit
against upstream pin 7f3f843c flagged 12 user-visible knobs that were
neither set via the typed proto fields nor reachable via options:.

Wire them up under the existing if/else chain in params_parse, before
the speculative section. Each new option follows the file's prevailing
patterns (try/catch around numeric parses, the same true/1/yes/on bool
form used elsewhere, hardware_concurrency() fallback for thread counts,
mirror of draft_override_tensor for override_tensor).

Top-level / batching / IO:
  - n_ubatch (alias ubatch) -- physical batch size; was previously
    force-aliased to n_batch at line 482, blocking embedding/rerank
    workloads that need independent control
  - threads_batch (alias n_threads_batch) -- main-model batch threads;
    mirrors the existing draft_threads_batch
  - direct_io (alias use_direct_io) -- O_DIRECT model loads
  - verbosity -- llama.cpp log threshold (line 479 had this commented
    out)
  - override_tensor (alias tensor_buft_overrides) -- per-tensor buffer
    overrides for the main model; mirrors draft_override_tensor

Embedding / multimodal:
  - pooling_type (alias pooling) -- mean/cls/last/rank/none; previously
    only auto-flipped to RANK for rerankers
  - embd_normalize (alias embedding_normalize) -- and the embedding
    handler now reads params_base.embd_normalize instead of a hardcoded
    2 at the previous embd_normalize literal in Embedding()
  - mmproj_use_gpu (alias mmproj_offload) -- mmproj on CPU vs GPU
  - image_min_tokens / image_max_tokens -- per-image vision token budget

Reasoning surface (the audit-focus three; LocalAI's existing
ReasoningConfig.DisableReasoning only feeds the per-request
chat_template_kwargs.enable_thinking and does not touch any of these):
  - reasoning_format -- none/auto/deepseek/deepseek-legacy parser
  - enable_reasoning (alias reasoning_budget) -- -1/0/>0 thinking budget
  - prefill_assistant -- trailing-assistant-message prefill toggle

All 14 referenced fields exist on both the upstream pin and the
turboquant fork's common.h, so no LOCALAI_LEGACY_LLAMA_CPP_SPEC guard
is needed.

Docs: extend model-configuration.md with new "Reasoning Models",
"Multimodal Backend Options", "Embedding & Reranking Backend Options",
and "Other Backend Tuning Options" subsections; also refresh the
Speculative Type Values table to show the new dash-separated canonical
names alongside the underscore aliases LocalAI still accepts.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: claude-code:claude-opus-4-7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants