21 — Performance Budgets

Concrete numerical targets so “good enough” is measurable. Without budgets, performance regressions are invisible and “feels slow” arguments are unwinnable.

Why budgets

Local AI apps live or die on latency. A 200ms vs 500ms first-token time is the difference between “snappy” and “annoying.” Without explicit budgets, every component team independently optimizes (or doesn’t) and the user-perceived performance is the convolution of all of them.

These budgets are:

Targets — what we aim for in v1.
Tested — locara test includes performance assertions.
Per-tier — different budgets for low/mid/high device profiles (see 02-manifest.md).
Enforceable in CI — regressions fail the build.

Reference hardware

All targets are stated for these reference devices (representative of real users in 2026):

Tier	Reference	RAM	GPU/NPU
low	Mac mini M1 (2020)	8 GB	Apple Silicon GPU, 7-core
mid	MacBook Pro M2 (2023)	16 GB	Apple Silicon GPU, 10-core
high	Mac Studio M2 Max (2023)	32 GB	Apple Silicon GPU, 30-core

Newer Apple Silicon (M3, M4) exceeds these by 20–40%; older Intel Macs are out of scope.

Cold-start budgets

Time from user double-clicking the app to the app’s UI being interactive (first interactive content rendered, accepting input):

Phase	Target (mid)	Target (low)	Target (high)
Process spawn	< 100ms	< 200ms	< 50ms
Tauri webview init	< 300ms	< 500ms	< 200ms
Frontend bundle load	< 400ms	< 700ms	< 200ms
First model preload (background)	< 5s	< 10s	< 3s
Total to interactive	< 1.5s	< 2.5s	< 1s

Models load lazily by default — the app is interactive before models are loaded. Eager-load is opt-in via manifest.

Inference budgets

For text-to-text via llama.cpp:

Model size	Target (mid, M2)	Target (low, M1)	Target (high, M2 Max)
3B Q4	TTFT < 200ms; sustained > 50 tok/s	TTFT < 400ms; > 25 tok/s	TTFT < 100ms; > 80 tok/s
7B Q4	TTFT < 350ms; > 30 tok/s	(not recommended on low)	TTFT < 200ms; > 50 tok/s
14B Q4	(not recommended on mid)	(not on low)	TTFT < 400ms; > 25 tok/s

TTFT = time to first token. Numbers are for short prompts (<512 tokens); long-context is materially slower and a separate budget.

For embeddings:

Model	Target latency (per 1k texts, mid)
nomic-embed-text-v1.5	< 5s
bge-large-en-v1.5	< 10s

For STT:

Model	Target latency (per 1 minute audio, mid)
Whisper-base-Q4	< 5s
Whisper-large-v3-Q4	< 30s

For OCR:

Model	Target latency (per page, mid)
GLM-OCR-1.5	< 3s
RapidOCR	< 1s

These are floors — if a model on Locara performs worse than these on its target tier, we either re-quantize, choose a different model variant, or de-list it.

Storage budgets

SQLite query latency on a typical app database (~10k–100k rows):

Operation	Target (mid)
`SELECT *` simple	< 1ms
Indexed lookup by primary key	< 1ms
Join across 2 indexed tables	< 5ms
FTS5 keyword search	< 50ms
sqlite-vec similarity search (k=5, 10k vectors)	< 20ms
sqlite-vec similarity search (k=5, 100k vectors)	< 100ms
Hybrid search (FTS5 + vec, rank fusion)	< 150ms

If an app’s actual query latency exceeds these by 2x, it’s a code smell — usually missing index or unbounded scan.

Memory budgets

Per-app memory ceiling, derived from the app’s declared profile:

Profile	RAM budget (app + loaded models)
low (declares min_ram_gb: 8)	≤ 4 GB
mid (declares min_ram_gb: 16)	≤ 8 GB
high (declares min_ram_gb: 32)	≤ 20 GB

Budget = OS overhead headroom (~25% of system RAM kept free) plus assumption of 1–2 other apps open.

The runtime tracks each app’s resident set + loaded model size and:

Warns at 80% of budget.
Unloads least-recently-used model at 95%.
Refuses new model load + surfaces error at 100%.

Disk budgets

Per-app total disk footprint:

Component	Target (mid app)
App bundle	< 30 MB
Locara runtime overhead	shared, ~30 MB
Bundled models (if any)	< 200 MB
Cached models (shared across apps)	varies; deduped via content addressing
Per-app SQLite DB	grows with use; no upper limit
Logs	< 10 MB; rotated

Apps that need to ship multi-GB models bundle-free (which is the default — models fetched at install) keep the binary small.

Battery budgets

Apps doing extended inference drain battery. Targets:

Workload	Target (mid laptop, full charge)
Idle Locara Manager menubar utility (optional, phase 3+)	< 0.5% / hour
Open Locara app, no inference	< 1% / hour
Active LLM chat (user typing, occasional gen)	< 5% / hour
Continuous transcription of meeting	< 15% / hour
Sustained heavy inference (e.g., batch summarization)	< 25% / hour

These are soft targets. Apps doing heavy work should warn the user (“this will use battery; consider plugging in”).

Network budgets

Locara apps with net: false have zero network usage. (This is structural, not performance.) For apps that declare net:

Operation	Target
Per-app update check (Tauri updater)	< 10 KB / day
Registry catalog fetch (browsing)	< 100 KB / session
Model fetch (one-time)	as fast as the user’s connection

The Locara framework itself never makes network calls beyond:

Registry update checks (per-app, daily by default; user-configurable per app).
Model fetch on install (initiated by user installing an app).
Submission to CI on locara publish (developer machine only).

Measuring + enforcement

locara test has a --bench mode that runs a fixed workload and measures against budgets:

$ locara test --bench
✓ Cold start: 1.2s (budget 1.5s)
✓ Chat TTFT: 180ms (budget 200ms)
✓ Chat sustained: 53 tok/s (budget 50 tok/s)
✗ Hybrid search: 200ms (budget 150ms) — REGRESSION

CI runs locara test --bench on every commit. Regressions fail the build.

For published apps, the registry auto-runs benchmarks against the reference hardware tier and publishes results on the app card. Users can see “this app: chat at 35 tok/s on mid hardware” before installing.

Profile picker hygiene

A user with a low-tier device installing an app that targets mid:

Transcribe targets 16 GB Macs.
Your Mac has 8 GB. The app may be slow or fail to load some models.

Suggested: low-tier alternative quantizations.
[Install anyway]  [Cancel]

This is the device-fit advisor (already in 15-distribution.md) — performance budgets quantify when to show it.

Open questions

(open) Should we publish public benchmarks comparing Locara apps to cloud equivalents? Pro: marketing. Con: invites cherry-picking criticism. Probably yes, with full methodology disclosure.
(open) Performance regression policy — how big a regression triggers what? A 10% drop in TTFT might be acceptable; 30% should fail CI.
(open) Are battery budgets enforceable, or just guideline? Enforcement requires running tests on real laptops; CI uses GitHub-hosted runners. Probably guideline + occasional manual audit.

Cross-references

Profile + device-fit definitions: 02-manifest.md
Runtime memory management: 07-runtime.md
Storage backends: 08-storage.md
Models + quantization choices: 09-models.md
ADR 0002 (llama.cpp default): ../docs/adr/0002-llamacpp-v1-mlx-v2.md