Locara

text-to-text

HF group: NLP / Multimodal · Status: ✅ shipped

HF aliases: text-generation.

What it is

Classic LLM chat. Tokens in, tokens out, streaming. The flagship Locara modality and the foundation everything else falls back on when there’s no specialized model wired.

Open-weight models (≤30 B params, instruct/chat tuned)

ModelParamsReleasedLicenseQualityNotes
Qwen3-1.7B1.7 B2026-04Apache-2.0Solid for sizeDefault Locara reference. ~2 GB Q4.
Qwen3-7B-Instruct7 B2026-04Apache-2.0StrongQuality/speed sweet spot for M-series 16 GB.
Qwen3-30B-A3B (MoE)30 B / 3 B active2026-04Apache-2.0GPT-4-class on many benchesMoE — only 3 B active per token, fits in ~16 GB Q4.
Llama-4-8B-Instruct8 B2026-Q1Llama communityStrongStrict license; not Locara-default.
Mistral Medium 3.524 B2026Mistral ResearchStrongResearch-only license.
Gemma 3-26B-A4B (MoE)26 B / 3.8 B active2026GemmaStrongMoE; ~15 GB Q4.
Phi-4-mini3.8 B2026MITPunch-above-weightMicrosoft. Fast on M2/M3.
DeepSeek-V3-Lite16 B / 2 B active2026DeepSeekStrongPermissive license; MoE.

Infrastructure required

Inference

  • locara-llama wraps llama.cpp with Metal acceleration. Handles GGUF / safetensors, dynamic quantization, KV-cache.
  • ✅ Chat-template aware tokenizer so different models’ prompt formats work without per-model code in the SDK.

Input

  • Plain UTF-8 text strings. No special capture infrastructure.

Output

  • ✅ Streaming token Channel<TokenEvent> with cancellation via cooperative cancel + AbortSignal from the SDK side.

Storage

  • ✅ Weights via locara-models::Cache (content-addressed blobs/<sha> layout, refcount-based GC).
  • ❌ KV-cache warm-keep across sessions (would speed up turn-2 latency; not implemented).
  • Stateless turn-based — no per-session DB rows.

Interaction (IPC + SDK)

  • ✅ IPC: llm.chat, llm.chat_stream (Tauri commands in crates/locara-runtime/src/tauri_plugin.rs).
  • ✅ SDK: llm.chat({ messages, options }), llm.chatStream(...) in packages/sdk/src/llm.ts.

Capabilities (manifest)

  • capabilities.models[] must list a chat-tuned model (e.g. qwen2.5-1.5b-instruct-q4_k_m@sha256:...). Per-call enforcement against Capability::Model(...) in the runtime.

Gaps

  • Grammar / JSON-mode constraint sampling is wired but feature-gated off pending an upstream llama.cpp fix (BACKLOG: “Re-enable grammar in agent loop”).
  • KV-cache warm-keep across sessions for faster turn-2 latency: not implemented.

See also