text-to-text
HF group: NLP / Multimodal · Status: ✅ shipped
HF aliases: text-generation.
What it is
Classic LLM chat. Tokens in, tokens out, streaming. The flagship Locara modality and the foundation everything else falls back on when there’s no specialized model wired.
Open-weight models (≤30 B params, instruct/chat tuned)
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| Qwen3-1.7B | 1.7 B | 2026-04 | Apache-2.0 | Solid for size | Default Locara reference. ~2 GB Q4. |
| Qwen3-7B-Instruct | 7 B | 2026-04 | Apache-2.0 | Strong | Quality/speed sweet spot for M-series 16 GB. |
| Qwen3-30B-A3B (MoE) | 30 B / 3 B active | 2026-04 | Apache-2.0 | GPT-4-class on many benches | MoE — only 3 B active per token, fits in ~16 GB Q4. |
| Llama-4-8B-Instruct | 8 B | 2026-Q1 | Llama community | Strong | Strict license; not Locara-default. |
| Mistral Medium 3.5 | 24 B | 2026 | Mistral Research | Strong | Research-only license. |
| Gemma 3-26B-A4B (MoE) | 26 B / 3.8 B active | 2026 | Gemma | Strong | MoE; ~15 GB Q4. |
| Phi-4-mini | 3.8 B | 2026 | MIT | Punch-above-weight | Microsoft. Fast on M2/M3. |
| DeepSeek-V3-Lite | 16 B / 2 B active | 2026 | DeepSeek | Strong | Permissive license; MoE. |
Infrastructure required
Inference
- ✅
locara-llamawraps llama.cpp with Metal acceleration. Handles GGUF / safetensors, dynamic quantization, KV-cache. - ✅ Chat-template aware tokenizer so different models’ prompt formats work without per-model code in the SDK.
Input
- Plain UTF-8 text strings. No special capture infrastructure.
Output
- ✅ Streaming token
Channel<TokenEvent>with cancellation via cooperative cancel +AbortSignalfrom the SDK side.
Storage
- ✅ Weights via
locara-models::Cache(content-addressedblobs/<sha>layout, refcount-based GC). - ❌ KV-cache warm-keep across sessions (would speed up turn-2 latency; not implemented).
- Stateless turn-based — no per-session DB rows.
Interaction (IPC + SDK)
- ✅ IPC:
llm.chat,llm.chat_stream(Tauri commands incrates/locara-runtime/src/tauri_plugin.rs). - ✅ SDK:
llm.chat({ messages, options }),llm.chatStream(...)inpackages/sdk/src/llm.ts.
Capabilities (manifest)
- ✅
capabilities.models[]must list a chat-tuned model (e.g.qwen2.5-1.5b-instruct-q4_k_m@sha256:...). Per-call enforcement againstCapability::Model(...)in the runtime.
Gaps
- Grammar / JSON-mode constraint sampling is wired but feature-gated off pending an upstream llama.cpp fix (BACKLOG: “Re-enable grammar in agent loop”).
- KV-cache warm-keep across sessions for faster turn-2 latency: not implemented.
See also
text-to-text-thinking— same shape but with reasoning-trace splitting.text-to-code— code-specific fine-tunes ride on the same inference path.- Crates:
locara-llama,locara-core::InferenceBackend. - Index:
../modalities-and-models-survey.md