21 — Performance Budgets
Concrete numerical targets so “good enough” is measurable. Without budgets, performance regressions are invisible and “feels slow” arguments are unwinnable.
Why budgets
Local AI apps live or die on latency. A 200ms vs 500ms first-token time is the difference between “snappy” and “annoying.” Without explicit budgets, every component team independently optimizes (or doesn’t) and the user-perceived performance is the convolution of all of them.
These budgets are:
- Targets — what we aim for in v1.
- Tested —
locara testincludes performance assertions. - Per-tier — different budgets for low/mid/high device profiles (see 02-manifest.md).
- Enforceable in CI — regressions fail the build.
Reference hardware
All targets are stated for these reference devices (representative of real users in 2026):
| Tier | Reference | RAM | GPU/NPU |
|---|---|---|---|
| low | Mac mini M1 (2020) | 8 GB | Apple Silicon GPU, 7-core |
| mid | MacBook Pro M2 (2023) | 16 GB | Apple Silicon GPU, 10-core |
| high | Mac Studio M2 Max (2023) | 32 GB | Apple Silicon GPU, 30-core |
Newer Apple Silicon (M3, M4) exceeds these by 20–40%; older Intel Macs are out of scope.
Cold-start budgets
Time from user double-clicking the app to the app’s UI being interactive (first interactive content rendered, accepting input):
| Phase | Target (mid) | Target (low) | Target (high) |
|---|---|---|---|
| Process spawn | < 100ms | < 200ms | < 50ms |
| Tauri webview init | < 300ms | < 500ms | < 200ms |
| Frontend bundle load | < 400ms | < 700ms | < 200ms |
| First model preload (background) | < 5s | < 10s | < 3s |
| Total to interactive | < 1.5s | < 2.5s | < 1s |
Models load lazily by default — the app is interactive before models are loaded. Eager-load is opt-in via manifest.
Inference budgets
For text-to-text via llama.cpp:
| Model size | Target (mid, M2) | Target (low, M1) | Target (high, M2 Max) |
|---|---|---|---|
| 3B Q4 | TTFT < 200ms; sustained > 50 tok/s | TTFT < 400ms; > 25 tok/s | TTFT < 100ms; > 80 tok/s |
| 7B Q4 | TTFT < 350ms; > 30 tok/s | (not recommended on low) | TTFT < 200ms; > 50 tok/s |
| 14B Q4 | (not recommended on mid) | (not on low) | TTFT < 400ms; > 25 tok/s |
TTFT = time to first token. Numbers are for short prompts (<512 tokens); long-context is materially slower and a separate budget.
For embeddings:
| Model | Target latency (per 1k texts, mid) |
|---|---|
| nomic-embed-text-v1.5 | < 5s |
| bge-large-en-v1.5 | < 10s |
For STT:
| Model | Target latency (per 1 minute audio, mid) |
|---|---|
| Whisper-base-Q4 | < 5s |
| Whisper-large-v3-Q4 | < 30s |
For OCR:
| Model | Target latency (per page, mid) |
|---|---|
| GLM-OCR-1.5 | < 3s |
| RapidOCR | < 1s |
These are floors — if a model on Locara performs worse than these on its target tier, we either re-quantize, choose a different model variant, or de-list it.
Storage budgets
SQLite query latency on a typical app database (~10k–100k rows):
| Operation | Target (mid) |
|---|---|
SELECT * simple | < 1ms |
| Indexed lookup by primary key | < 1ms |
| Join across 2 indexed tables | < 5ms |
| FTS5 keyword search | < 50ms |
| sqlite-vec similarity search (k=5, 10k vectors) | < 20ms |
| sqlite-vec similarity search (k=5, 100k vectors) | < 100ms |
| Hybrid search (FTS5 + vec, rank fusion) | < 150ms |
If an app’s actual query latency exceeds these by 2x, it’s a code smell — usually missing index or unbounded scan.
Memory budgets
Per-app memory ceiling, derived from the app’s declared profile:
| Profile | RAM budget (app + loaded models) |
|---|---|
| low (declares min_ram_gb: 8) | ≤ 4 GB |
| mid (declares min_ram_gb: 16) | ≤ 8 GB |
| high (declares min_ram_gb: 32) | ≤ 20 GB |
Budget = OS overhead headroom (~25% of system RAM kept free) plus assumption of 1–2 other apps open.
The runtime tracks each app’s resident set + loaded model size and:
- Warns at 80% of budget.
- Unloads least-recently-used model at 95%.
- Refuses new model load + surfaces error at 100%.
Disk budgets
Per-app total disk footprint:
| Component | Target (mid app) |
|---|---|
| App bundle | < 30 MB |
| Locara runtime overhead | shared, ~30 MB |
| Bundled models (if any) | < 200 MB |
| Cached models (shared across apps) | varies; deduped via content addressing |
| Per-app SQLite DB | grows with use; no upper limit |
| Logs | < 10 MB; rotated |
Apps that need to ship multi-GB models bundle-free (which is the default — models fetched at install) keep the binary small.
Battery budgets
Apps doing extended inference drain battery. Targets:
| Workload | Target (mid laptop, full charge) |
|---|---|
| Idle Locara Manager menubar utility (optional, phase 3+) | < 0.5% / hour |
| Open Locara app, no inference | < 1% / hour |
| Active LLM chat (user typing, occasional gen) | < 5% / hour |
| Continuous transcription of meeting | < 15% / hour |
| Sustained heavy inference (e.g., batch summarization) | < 25% / hour |
These are soft targets. Apps doing heavy work should warn the user (“this will use battery; consider plugging in”).
Network budgets
Locara apps with net: false have zero network usage. (This is structural, not performance.) For apps that declare net:
| Operation | Target |
|---|---|
| Per-app update check (Tauri updater) | < 10 KB / day |
| Registry catalog fetch (browsing) | < 100 KB / session |
| Model fetch (one-time) | as fast as the user’s connection |
The Locara framework itself never makes network calls beyond:
- Registry update checks (per-app, daily by default; user-configurable per app).
- Model fetch on install (initiated by user installing an app).
- Submission to CI on
locara publish(developer machine only).
Measuring + enforcement
locara test has a --bench mode that runs a fixed workload and measures against budgets:
$ locara test --bench
✓ Cold start: 1.2s (budget 1.5s)
✓ Chat TTFT: 180ms (budget 200ms)
✓ Chat sustained: 53 tok/s (budget 50 tok/s)
✗ Hybrid search: 200ms (budget 150ms) — REGRESSION
CI runs locara test --bench on every commit. Regressions fail the build.
For published apps, the registry auto-runs benchmarks against the reference hardware tier and publishes results on the app card. Users can see “this app: chat at 35 tok/s on mid hardware” before installing.
Profile picker hygiene
A user with a low-tier device installing an app that targets mid:
Transcribe targets 16 GB Macs.
Your Mac has 8 GB. The app may be slow or fail to load some models.
Suggested: low-tier alternative quantizations.
[Install anyway] [Cancel]
This is the device-fit advisor (already in 15-distribution.md) — performance budgets quantify when to show it.
Open questions
- (open) Should we publish public benchmarks comparing Locara apps to cloud equivalents? Pro: marketing. Con: invites cherry-picking criticism. Probably yes, with full methodology disclosure.
- (open) Performance regression policy — how big a regression triggers what? A 10% drop in TTFT might be acceptable; 30% should fail CI.
- (open) Are battery budgets enforceable, or just guideline? Enforcement requires running tests on real laptops; CI uses GitHub-hosted runners. Probably guideline + occasional manual audit.
Cross-references
- Profile + device-fit definitions: 02-manifest.md
- Runtime memory management: 07-runtime.md
- Storage backends: 08-storage.md
- Models + quantization choices: 09-models.md
- ADR 0002 (llama.cpp default):
../docs/adr/0002-llamacpp-v1-mlx-v2.md