09 — Models
Approach
Locara does not build a competing model registry. Hugging Face is the source of truth for weights. Locara provides a curated layer on top:
- Locara model manifest — a curated subset of HF models we’ve tested and signed off on, with validated chat templates, tokenizer configs, and recommended params.
- Content-addressed cache — local, shared across apps, deduped by SHA.
- Routing layer — picks llama.cpp vs MLX based on hardware.
- Pinning — every app pins exact model hashes; reproducible.
Locara model manifest
A separate manifest from app manifests. Lives in registry/models/<model-id>.json:
{
"id": "qwen2.5-3b-instruct-q4",
"version": "1.0.2",
"displayName": "Qwen 2.5 3B Instruct (Q4_K_M)",
"description": "Quantized chat model, good for general assistant tasks.",
"tags": ["chat", "general", "small"],
"license": "Apache-2.0",
"license_url": "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct/blob/main/LICENSE",
"modality": "chat",
"context_length": 32768,
"size_bytes": 2000000000,
"ram_required_gb": 4,
"artifacts": {
"llamacpp": {
"format": "gguf",
"url": "https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf",
"sha256": "abc123..."
},
"mlx": {
"format": "mlx-quantized",
"url": "https://huggingface.co/mlx-community/Qwen2.5-3B-Instruct-4bit/resolve/main/...",
"sha256": "def456..."
}
},
"config": {
"chat_template": "...",
"stop_tokens": ["<|im_end|>"],
"default_params": {
"temperature": 0.7,
"top_p": 0.9
}
},
"validated_at": "2026-04-01",
"validated_by": "locara-team"
}
Apps reference this manifest by id + sha256 of the artifact:
"models": [
"qwen2.5-3b-instruct-q4@sha256:abc123..."
]
Content addressing
The shared model cache lives at ~/Library/Caches/Locara/models/. Each model artifact stored by its SHA-256:
models/
├── abc123.../
│ ├── model.gguf # the artifact
│ └── meta.json # locara-side metadata (id, version, last_used)
├── def456.../
│ └── model.mlx
└── ...
Apps’ bundles contain hardlinks into this cache, not copies. Multiple apps using qwen2.5-3b-q4@sha256:abc share one disk copy.
Garbage collection: when an app is uninstalled, its hardlinks go. When refcount on a model dir hits zero, the runtime can GC the cache directory (configurable; default = “GC if untouched for 30 days”).
Fetch flow
- App is installed, manifest references models.
- Runtime checks cache for each model’s SHA.
- For missing models: fetch from
artifacts.<backend>.url(Hugging Face direct or Locara CDN mirror). - Verify SHA-256 against manifest after download.
- If mismatch: delete, error, refuse install.
- If match: hardlink into app bundle.
The Locara runtime does the fetch, not the app. Apps never need network access for model loading.
(open) Should Locara host its own CDN mirror of curated models? Pro: more reliable, faster, dedup across users. Con: bandwidth costs, legal questions on hosting weights. Probably yes for the most popular ~50 models, no for the long tail.
Inference backends
llama.cpp
- Cross-platform (Mac, Linux, Windows).
- Quantizations: Q2, Q3, Q4, Q5, Q6, Q8 (recommend Q4_K_M as default).
- Format: GGUF.
- Rust binding:
llama-cpp-2or our own. - Apple Silicon: Metal acceleration.
MLX (Apple Silicon only)
- Native to Apple Silicon, ~30–50% faster than llama.cpp on M-series.
- Format: MLX-quantized weights.
- Integration: Swift FFI or
mlx-rs(open). - Smaller model coverage than llama.cpp ecosystem.
Routing rules
For an app declaring a model with both backends:
if user_device == "Apple Silicon" && model.artifacts.mlx exists:
use MLX
else:
use llama.cpp
The app’s models[] declares one logical model; the runtime picks the backend. Apps don’t need to think about it.
If an app needs MLX-only behavior (e.g., for perf), they can declare the MLX-specific artifact; the app fails to install on non-Apple-Silicon.
(open: A vs B) v1 default backend on Apple Silicon. Leaning MLX with llama.cpp fallback. Decision criteria:
- If MLX integration is mature enough (
mlx-rsis fine) → MLX-default. - If MLX requires Swift FFI complexity → llama.cpp-default for v1, MLX-default in v2.
Model categories (curated)
For v1, the curated registry includes representative models per category:
| Modality | Recommended | Alternatives |
|---|---|---|
| Chat (small) | Qwen2.5-3B-Instruct-Q4 | Llama-3.2-3B-Instruct-Q4 |
| Chat (medium) | Qwen2.5-7B-Instruct-Q4 | Llama-3.1-8B-Instruct-Q4 |
| Chat (large) | Qwen2.5-14B-Instruct-Q4 | — |
| Embedding | nomic-embed-text-v1.5 | bge-large-en-v1.5 |
| STT | Whisper-large-v3-Q4 | Whisper-base-Q4 (low-tier) |
| OCR | GLM-OCR-1.5 | RapidOCR |
| Vision | Qwen2-VL-2B / 7B | LLaVA-NeXT |
Each entry has been: tested, validated for chat template / tokenizer correctness, recommended params verified, license confirmed.
The registry expands over time. Adding a model = a PR with the manifest entry + signed-off validation.
Quantization recommendations
- Q4_K_M is the v1 recommended quantization for chat models. Best size/quality tradeoff.
- Q8 for embedding / OCR / vision models where smaller quants degrade more.
- Q2/Q3 only on resource-constrained devices.
The manifest can declare multiple quants per logical model; the runtime picks based on device profile.
App-bundled vs registry-fetched models
v1 default: Registry-fetched. App manifest references model by id+SHA; runtime fetches into shared cache.
v1 also supported: App-bundled. Small models (<200MB) can ship inside the app bundle if the developer prefers. Useful for purely-offline apps that don’t want any network at all.
"models": [
{
"id": "small-utility-model",
"bundled": true,
"path": "./models/small.gguf",
"sha256": "abc..."
}
]
Bundled models still require an SHA, still get verified, still go through the same loading path.
License compliance
Each Locara-curated model has a verified license. The model manifest includes license (SPDX) and license_url.
Apps using non-permissive models (e.g., Llama’s community license) get a warning at locara verify:
⚠ Model "llama-3.1-8B" uses the Llama Community License.
Apps using this model may have commercial use restrictions.
Continue? [y/N]
The registry surfaces license info on app pages so users see “this app uses models with license X.”
Future considerations
- Fine-tuned model registry — apps want to ship custom fine-tunes. Out of v1.
- Streaming-protocol speculation (e.g., fast token-streaming via shared memory) — perf optimization.
- Per-user model fine-tuning — way out of scope.
- Model A/B testing — out of scope.
Cross-references
- Manifest model declarations: 02-manifest.md
- Modality declarations that imply models: 04-modalities.md
- SDK calling models: 05-sdk.md
- Runtime model lifecycle: 07-runtime.md
- HF Hub research:
../notes/huggingface-hub.md - Ollama Modelfile pattern:
../notes/ollama.md - LM Studio MLX integration:
../notes/lm-studio.md