32 — Resource Policy (Memory, Models, Processes)

How the Locara runtime keeps multiple apps healthy on a Mac with finite RAM, GPU memory, and disk. This is the engineering layer where “5 apps each loading a 4GB model” stops being theoretical and starts crashing laptops.

The fundamental problem

Each Locara app declares a min_ram_gb profile (see 02-manifest.md). The runtime knows how much memory the user’s Mac has. When several apps are open, they compete for:

Wired RAM for loaded model weights.
Process working memory for inference state, KV caches.
GPU / Metal unified memory for compute.
Disk for shared model cache.
CPU / GPU compute for parallel inference requests.

The user’s Mac has a hard ceiling. The runtime decides who wins, who waits, and who fails gracefully.

v1 vs v2 architecture

v1: each app is its own process; macOS coordinates everything

Each Locara app is its own standalone Mac app — its own process. There is no shared Locara runtime process. Apps are independent and macOS handles process scheduling, memory pressure signals, and CPU/GPU contention.

Two apps using Whisper-large-v3 each open the same file from the shared model cache. macOS’s mmap dedup means the file’s pages can be shared across processes at the OS level — RAM duplication is partial, not full. Each app pays for its own KV cache and inference state, but model weights themselves are shared at the page-cache level.

Implications:

Simple, no IPC complexity, no daemon to manage.
Crashes are isolated to each app.
Standard Mac multi-app behavior — feels native.
mmap-based dedup gives partial sharing of model RAM for free, without a daemon.
Each app independently monitors macOS memory-pressure signals and responds.

v2 (optional): shared-runtime daemon

If profiling reveals RAM duplication is a real problem in heavy multi-app usage, a future optional locara-daemon could:

Hold a single in-memory copy of frequently-used models.
Route inference requests across apps via local IPC.
Provide global RAM budget arbitration.

This would be transparent to apps — the SDK API doesn’t change; the runtime just routes through the daemon when it’s running. Apps would behave identically without it.

v2 is not committed. v1 ships without a daemon. The architecture leaves room for v2 if profiling justifies the operational complexity.

Memory budgets per app

App declares a profile (low, mid, high). The runtime maps profile → soft RAM budget:

Profile	min_ram_gb (system)	App soft budget	Hard limit
low	8 GB	4 GB	5 GB
mid	16 GB	8 GB	10 GB
high	32 GB	20 GB	24 GB

Soft budget: runtime starts evicting LRU models when approached. User sees a “memory pressure” badge in the dev panel.

Hard limit: runtime refuses new model loads. The SDK call returns ResourceNotAvailableError. App should handle gracefully (smaller model, prompt user to close something, etc.).

The budget includes:

Loaded model weights (largest contributor).
Active inference state (KV cache, attention buffers).
App’s own JS heap + Rust working memory.

It excludes:

App’s SQLite WAL (kept on disk).
Tauri webview memory (small).
OS overhead.

Multi-app coordination (v1)

Each app runs in its own process and is largely unaware of others. macOS handles most coordination:

macOS-level coordination (free)

Memory pressure signals. macOS notifies all running processes when system memory is constrained. Each Locara app’s runtime independently responds (begin evicting LRU models, refuse new loads, etc.).
Process scheduling. macOS’s scheduler arbitrates CPU/GPU between apps fairly.
mmap page sharing. Multiple apps mapping the same model file share physical pages at the OS level. Free dedup.
Per-app sandbox + resource limits. macOS enforces process-level limits independently of Locara.

This is the same coordination Mac users get when running multiple apps; nothing Locara-specific.

Locara-level cooperative awareness (lightweight)

For better UX, Locara apps voluntarily share state via a small file at ~/Library/Application Support/Locara/runtime-state.json (locked write, free read):

Each running Locara app writes its PID + budget on launch, removes on quit.
A new Locara app launching can read this to estimate total memory usage.
A friendly warning surfaces if cumulative budgets exceed available RAM.

This is advisory, not enforcing. Each app still makes its own decisions; the file is just shared situational awareness.

Pre-launch check (best-effort)

When a Locara app launches, the runtime:

Reads the runtime-state file to see other Locara apps.
Sums their soft budgets.
Adds its own.
Compares to total system RAM (minus 25% OS reserve).
If over budget: warns the user via the app’s own UI. “You have Transcribe (8GB) and DocVault (8GB) open. This app will use up to 8GB more. Your Mac has 16GB total. Continue anyway?”

The warning is informational. The user decides; the runtime doesn’t refuse to launch.

Memory pressure signaling

macOS provides memory-pressure notifications via dispatch_source_create(DISPATCH_SOURCE_TYPE_MEMORYPRESSURE). When the system signals warning or critical:

Each Locara app independently receives the signal.
Each begins evicting its own LRU models.
Critical level: apps stop loading new models; surface “system under pressure” UI.

No central coordinator needed — every app responds to the same OS signal.

Graceful degradation

When an app can’t load a needed model:

ResourceNotAvailableError {
  resource: "memory",
  required: "4 GB for whisper-large-v3-q4",
  available: "1.2 GB free in app budget",
  suggestion: "Try the smaller model variant or close other Locara apps"
}

App authors are encouraged to handle this with fallbacks (smaller model, queue the work, prompt user). Reference apps will demonstrate.

Model loading lifecycle

Lazy by default

When an app is launched, models declared in capabilities.models[] are NOT auto-loaded. The first call that needs a model triggers load.

This avoids “open 5 apps and lose 20GB of RAM doing nothing.”

Eager opt-in

App can declare eager in the manifest for specific models that should preload:

"models": [
  { "id": "whisper-large-v3-q4@sha256:...", "eager": true },
  "qwen-3b-q4@sha256:..."  // lazy
]

Useful for apps where first-call latency matters (live transcription).

Eviction policy

When budget is approached:

LRU first. The least-recently-used model is unloaded.
Active inference protected. A model currently mid-inference is not evicted; eviction targets idle models.
Eager models last. Models declared eager: true are evicted last (with a warning).
Fail load if eviction insufficient. If even after eviction the new model can’t fit, return error.

Eviction never runs during inference — only at idle moments to avoid latency spikes mid-generation.

Cache vs loaded

Distinction:

Disk cache: model file on disk in ~/Library/Caches/Locara/models/. Cheap.
RAM-loaded: model in RAM, ready to infer. Expensive.

Eviction unloads from RAM but keeps the disk cache. Reload from disk is fast (seconds, not minutes) since no re-download.

Disk budgets

Per-app disk:

Component	Limit
App bundle	< 100 MB (CI rejects bigger without justification)
App data dir	unbounded; user manages
Models (shared cache)	unbounded; user manages
Logs	rotated; max 10 MB per app

The optional Locara Manager menubar utility (phase 3+) can surface disk usage per app + global model cache size. In v1, each app’s settings surface its own usage; users manage the shared cache via standard macOS tools (Storage settings, manual deletion).

GC: cached models with refcount = 0 (no installed app references them) are eligible for cleanup after 30 days idle.

CPU / GPU contention

Apple Silicon GPU is unified-memory + parallelism-limited. Two apps inferring simultaneously serialize at the GPU level.

v1 policy: simple FIFO per-process. The OS scheduler handles fairness. Apps trying to start inference while another is in-progress see modest queueing latency (~50-200ms typically).

Future (v2+): the daemon could implement priority queuing, fair scheduling. Out of scope for v1.

Per-app memory limits

The runtime enforces hard limits via:

macOS memory-pressure observation (process won’t get full system).
A custom resource monitor in locara-runtime that tracks loaded model size.
Wasmtime’s built-in memory caps for tool execution.

If an app process exceeds its hard limit:

Runtime sends a SIGTERM to the app process.
App can save state via Tauri’s exit hook (~5 second window).
User sees a notification: “Transcribe used too much memory and was closed. Files are saved.”
User can reopen.

This is the “can’t escape” backstop. Normally, apps stay well within budget.

Model cache global policy

The shared cache at ~/Library/Caches/Locara/models/ has a configurable maximum size, defaulting to 25% of free disk space.

When approaching the cap:

Refcount-zero models evicted first (LRU).
Refcount-positive models with multiple refs are kept.
New downloads pause until cleanup completes.

User can manually evict via standard macOS tools or the optional Locara Manager utility.

Parallel inference within an app

A single app may run multiple inference requests in parallel (e.g., a chat app where the LLM responds while STT transcribes a voice note).

v1 policy:

Each model’s inference is single-threaded internally (llama.cpp/MLX don’t parallelize within a request).
An app can have N requests in flight against N different models, limited by resources.
Same model, multiple requests: serialized within the model’s inference loop.
Total parallelism capped by: app’s RAM budget + GPU compute.

The SDK provides Promise.all(...)-style concurrency naturally; the runtime handles serialization where needed.

Inference cancellation + resource release

When an SDK call is cancelled (AbortController):

Inference loop checks the cancellation token at every token generation.
Stops generation within ~100ms.
Releases KV cache for that request.
Returns AbortError to the caller.

Critical for streaming UIs where the user navigates away mid-generation.

See 33-streaming.md for the protocol details.

Resource monitor surface

The dev panel (during locara dev) and the optional Locara Manager utility (phase 3+) show:

Per-app RAM (loaded models + working memory).
Global model cache size.
Disk free / total.
GPU utilization (when measurable).
Inference queue length per app.

Users see what’s eating resources and can act.

What the user sees on resource pressure

UX matters:

Soft pressure: subtle indicator, no interruption.
Approaching limit: banner: “Transcribe is using a lot of memory. Models may unload.”
Hit limit, fail gracefully: app handles it; user sees app’s own messaging.
Hit hard limit, runtime kills: notification: “Transcribe was closed to keep your Mac responsive.”
System pressure (macOS-level): all Locara apps de-prioritize; non-essential models unload.

Avoid spinner-of-death. Avoid silent failures.

Open questions

(open) Per-model RAM accounting — do we track via mmap pages or RSS? Probably mmap pages are the source of truth; RSS for diagnostics.
(open) GPU memory specifically — Apple Silicon’s unified memory makes this less distinct than discrete GPUs, but Metal still has buffer allocations. Track separately?
(open) Should we allow apps to declare priority? E.g., a foreground app gets resource preference over backgrounded ones. Could be abused; defer to v2.
(open) Battery-aware policy — when on battery, prefer smaller models? User-configurable; not v1.

Cross-references

Profile declarations: 02-manifest.md
Capability model: 03-capabilities.md
Runtime architecture: 07-runtime.md
Models layer: 09-models.md
Performance budgets: 21-performance-budgets.md
Streaming + cancellation: 33-streaming.md
Testing this: 30-testing-strategy.md