`text-to-text`

HF group: NLP / Multimodal · Status: ✅ shipped

HF aliases: text-generation.

What it is

Classic LLM chat. Tokens in, tokens out, streaming. The flagship Locara modality and the foundation everything else falls back on when there’s no specialized model wired.

Open-weight models (≤30 B params, instruct/chat tuned)

Model	Params	Released	License	Quality	Notes
Qwen3-1.7B	1.7 B	2026-04	Apache-2.0	Solid for size	Default Locara reference. ~2 GB Q4.
Qwen3-7B-Instruct	7 B	2026-04	Apache-2.0	Strong	Quality/speed sweet spot for M-series 16 GB.
Qwen3-30B-A3B (MoE)	30 B / 3 B active	2026-04	Apache-2.0	GPT-4-class on many benches	MoE — only 3 B active per token, fits in ~16 GB Q4.
Llama-4-8B-Instruct	8 B	2026-Q1	Llama community	Strong	Strict license; not Locara-default.
Mistral Medium 3.5	24 B	2026	Mistral Research	Strong	Research-only license.
Gemma 3-26B-A4B (MoE)	26 B / 3.8 B active	2026	Gemma	Strong	MoE; ~15 GB Q4.
Phi-4-mini	3.8 B	2026	MIT	Punch-above-weight	Microsoft. Fast on M2/M3.
DeepSeek-V3-Lite	16 B / 2 B active	2026	DeepSeek	Strong	Permissive license; MoE.

Infrastructure required

Inference

✅ locara-llama wraps llama.cpp with Metal acceleration. Handles GGUF / safetensors, dynamic quantization, KV-cache.
✅ Chat-template aware tokenizer so different models’ prompt formats work without per-model code in the SDK.

Input

Plain UTF-8 text strings. No special capture infrastructure.

Output

✅ Streaming token Channel<TokenEvent> with cancellation via cooperative cancel + AbortSignal from the SDK side.

Storage

✅ Weights via locara-models::Cache (content-addressed blobs/<sha> layout, refcount-based GC).
❌ KV-cache warm-keep across sessions (would speed up turn-2 latency; not implemented).
Stateless turn-based — no per-session DB rows.

Interaction (IPC + SDK)

✅ IPC: llm.chat, llm.chat_stream (Tauri commands in crates/locara-runtime/src/tauri_plugin.rs).
✅ SDK: llm.chat({ messages, options }), llm.chatStream(...) in packages/sdk/src/llm.ts.

Capabilities (manifest)

✅ capabilities.models[] must list a chat-tuned model (e.g. qwen2.5-1.5b-instruct-q4_k_m@sha256:...). Per-call enforcement against Capability::Model(...) in the runtime.

Gaps

Grammar / JSON-mode constraint sampling is wired but feature-gated off pending an upstream llama.cpp fix (BACKLOG: “Re-enable grammar in agent loop”).
KV-cache warm-keep across sessions for faster turn-2 latency: not implemented.

text-to-text