Locara

21 — Performance Budgets

Concrete numerical targets so “good enough” is measurable. Without budgets, performance regressions are invisible and “feels slow” arguments are unwinnable.

Why budgets

Local AI apps live or die on latency. A 200ms vs 500ms first-token time is the difference between “snappy” and “annoying.” Without explicit budgets, every component team independently optimizes (or doesn’t) and the user-perceived performance is the convolution of all of them.

These budgets are:

  • Targets — what we aim for in v1.
  • Testedlocara test includes performance assertions.
  • Per-tier — different budgets for low/mid/high device profiles (see 02-manifest.md).
  • Enforceable in CI — regressions fail the build.

Reference hardware

All targets are stated for these reference devices (representative of real users in 2026):

TierReferenceRAMGPU/NPU
lowMac mini M1 (2020)8 GBApple Silicon GPU, 7-core
midMacBook Pro M2 (2023)16 GBApple Silicon GPU, 10-core
highMac Studio M2 Max (2023)32 GBApple Silicon GPU, 30-core

Newer Apple Silicon (M3, M4) exceeds these by 20–40%; older Intel Macs are out of scope.

Cold-start budgets

Time from user double-clicking the app to the app’s UI being interactive (first interactive content rendered, accepting input):

PhaseTarget (mid)Target (low)Target (high)
Process spawn< 100ms< 200ms< 50ms
Tauri webview init< 300ms< 500ms< 200ms
Frontend bundle load< 400ms< 700ms< 200ms
First model preload (background)< 5s< 10s< 3s
Total to interactive< 1.5s< 2.5s< 1s

Models load lazily by default — the app is interactive before models are loaded. Eager-load is opt-in via manifest.

Inference budgets

For text-to-text via llama.cpp:

Model sizeTarget (mid, M2)Target (low, M1)Target (high, M2 Max)
3B Q4TTFT < 200ms; sustained > 50 tok/sTTFT < 400ms; > 25 tok/sTTFT < 100ms; > 80 tok/s
7B Q4TTFT < 350ms; > 30 tok/s(not recommended on low)TTFT < 200ms; > 50 tok/s
14B Q4(not recommended on mid)(not on low)TTFT < 400ms; > 25 tok/s

TTFT = time to first token. Numbers are for short prompts (<512 tokens); long-context is materially slower and a separate budget.

For embeddings:

ModelTarget latency (per 1k texts, mid)
nomic-embed-text-v1.5< 5s
bge-large-en-v1.5< 10s

For STT:

ModelTarget latency (per 1 minute audio, mid)
Whisper-base-Q4< 5s
Whisper-large-v3-Q4< 30s

For OCR:

ModelTarget latency (per page, mid)
GLM-OCR-1.5< 3s
RapidOCR< 1s

These are floors — if a model on Locara performs worse than these on its target tier, we either re-quantize, choose a different model variant, or de-list it.

Storage budgets

SQLite query latency on a typical app database (~10k–100k rows):

OperationTarget (mid)
SELECT * simple< 1ms
Indexed lookup by primary key< 1ms
Join across 2 indexed tables< 5ms
FTS5 keyword search< 50ms
sqlite-vec similarity search (k=5, 10k vectors)< 20ms
sqlite-vec similarity search (k=5, 100k vectors)< 100ms
Hybrid search (FTS5 + vec, rank fusion)< 150ms

If an app’s actual query latency exceeds these by 2x, it’s a code smell — usually missing index or unbounded scan.

Memory budgets

Per-app memory ceiling, derived from the app’s declared profile:

ProfileRAM budget (app + loaded models)
low (declares min_ram_gb: 8)≤ 4 GB
mid (declares min_ram_gb: 16)≤ 8 GB
high (declares min_ram_gb: 32)≤ 20 GB

Budget = OS overhead headroom (~25% of system RAM kept free) plus assumption of 1–2 other apps open.

The runtime tracks each app’s resident set + loaded model size and:

  • Warns at 80% of budget.
  • Unloads least-recently-used model at 95%.
  • Refuses new model load + surfaces error at 100%.

Disk budgets

Per-app total disk footprint:

ComponentTarget (mid app)
App bundle< 30 MB
Locara runtime overheadshared, ~30 MB
Bundled models (if any)< 200 MB
Cached models (shared across apps)varies; deduped via content addressing
Per-app SQLite DBgrows with use; no upper limit
Logs< 10 MB; rotated

Apps that need to ship multi-GB models bundle-free (which is the default — models fetched at install) keep the binary small.

Battery budgets

Apps doing extended inference drain battery. Targets:

WorkloadTarget (mid laptop, full charge)
Idle Locara Manager menubar utility (optional, phase 3+)< 0.5% / hour
Open Locara app, no inference< 1% / hour
Active LLM chat (user typing, occasional gen)< 5% / hour
Continuous transcription of meeting< 15% / hour
Sustained heavy inference (e.g., batch summarization)< 25% / hour

These are soft targets. Apps doing heavy work should warn the user (“this will use battery; consider plugging in”).

Network budgets

Locara apps with net: false have zero network usage. (This is structural, not performance.) For apps that declare net:

OperationTarget
Per-app update check (Tauri updater)< 10 KB / day
Registry catalog fetch (browsing)< 100 KB / session
Model fetch (one-time)as fast as the user’s connection

The Locara framework itself never makes network calls beyond:

  • Registry update checks (per-app, daily by default; user-configurable per app).
  • Model fetch on install (initiated by user installing an app).
  • Submission to CI on locara publish (developer machine only).

Measuring + enforcement

locara test has a --bench mode that runs a fixed workload and measures against budgets:

$ locara test --bench
 Cold start: 1.2s (budget 1.5s)
 Chat TTFT: 180ms (budget 200ms)
 Chat sustained: 53 tok/s (budget 50 tok/s)
 Hybrid search: 200ms (budget 150ms) — REGRESSION

CI runs locara test --bench on every commit. Regressions fail the build.

For published apps, the registry auto-runs benchmarks against the reference hardware tier and publishes results on the app card. Users can see “this app: chat at 35 tok/s on mid hardware” before installing.

Profile picker hygiene

A user with a low-tier device installing an app that targets mid:

Transcribe targets 16 GB Macs.
Your Mac has 8 GB. The app may be slow or fail to load some models.

Suggested: low-tier alternative quantizations.
[Install anyway]  [Cancel]

This is the device-fit advisor (already in 15-distribution.md) — performance budgets quantify when to show it.

Open questions

  • (open) Should we publish public benchmarks comparing Locara apps to cloud equivalents? Pro: marketing. Con: invites cherry-picking criticism. Probably yes, with full methodology disclosure.
  • (open) Performance regression policy — how big a regression triggers what? A 10% drop in TTFT might be acceptable; 30% should fail CI.
  • (open) Are battery budgets enforceable, or just guideline? Enforcement requires running tests on real laptops; CI uses GitHub-hosted runners. Probably guideline + occasional manual audit.

Cross-references