Locara

30 — Testing Strategy

How the Locara framework itself is tested. (Testing of apps built on Locara is covered by @locara/test; see 05-sdk.md and 06-cli.md.)

The capability enforcement layer is where bugs become security incidents. Testing has to be honest about what can fail and design accordingly.

Layered test strategy

┌──────────────────────────────────────────────────────────┐
│  E2E tests (real apps run on real runtime)               │
├──────────────────────────────────────────────────────────┤
│  Integration tests (per-crate boundary tests)            │
├──────────────────────────────────────────────────────────┤
│  Unit tests (per-function, deterministic)                │
├──────────────────────────────────────────────────────────┤
│  Property-based tests (capability invariants, fuzzing)   │
├──────────────────────────────────────────────────────────┤
│  Static analysis (clippy, eslint, custom rules)          │
└──────────────────────────────────────────────────────────┘

Every level catches different bugs. Skipping any one shifts cost downstream.

Coverage targets

These are enforced in CI:

LayerCoverage targetEnforcement
locara-core (capability enforcer, model loader)90%+ line, 85%+ branchCI fails below threshold
locara-storage85%+ lineCI warns; review required
locara-models (fetch, hash verification)90%+ lineCI fails below threshold
locara-tools (Wasmtime sandbox)90%+ lineCI fails below threshold
locara-runtime (IPC, lifecycle)85%+CI warns
locara-cli70%+CI warns
@locara/sdk85%+CI warns
@locara/components70%+ (UI components are partly visual)CI warns
Reference appsbest-effortNot enforced

Capability-enforcing code paths are 100% covered. No exceptions. Untested capability check = security hole.

Unit tests

Standard per-function tests. Fast, deterministic, no I/O.

Conventions:

  • Rust: #[test] in mod tests at the bottom of the file. cargo test runs them.
  • TypeScript: vitest per package. <file>.test.ts colocated.

Naming: test_<thing_being_tested>_<scenario>_<expected>. Example: test_capability_check_undeclared_net_returns_denied.

Integration tests

Cross-crate / cross-package tests at the public API boundary.

Examples:

  • locara-core + locara-storage: load a model, run inference, write result to sqlite.
  • locara-runtime + locara-tools: app invokes a wasm tool with a scoped capability set.
  • @locara/sdk + locara-runtime: call llm.chat, verify it routes through Tauri IPC and respects manifest declarations.

Live in crates/<name>/tests/ (Rust) or packages/<name>/tests/integration/ (TS).

Property-based + fuzzing tests

Critical for the capability enforcer. Use proptest (Rust) and fast-check (TypeScript) to generate inputs.

Invariants we test as properties:

  1. No undeclared capability succeeds. For any randomly-generated manifest + any randomly-generated SDK call, if the manifest doesn’t declare the relevant capability, the call fails.
  2. Scope checks are honored. For any fs.read: ["~/scoped/**"] declaration, no path outside that glob can be read, regardless of how the path is constructed (relative, symlinks, .., URL-encoding tricks).
  3. Net allowlist is honored. For any net: { allowed_hosts: [...] } declaration, no outbound call to a host outside the list succeeds.
  4. Tool capability composition. For any tool requiring capability X, if the hosting app doesn’t declare X, the tool refuses to load.
  5. Capability cool-down monotonicity. A capability declared at version N+1 that’s broader than at version N triggers cool-down; narrowing never does.
  6. Manifest schema invariants. A valid manifest must validate; invalid manifests must fail validation (no false positives or negatives).

Fuzz targets (run continuously in CI nightly):

  • Manifest parser (give it garbage, must not crash).
  • Path-scope checker (give it adversarial paths, must not allow escape).
  • Wasm tool invocation (give it malicious wasm, must not let it escape).
  • Static analyzer (give it adversarial source code, must not miss capability uses).

Use cargo-fuzz or libfuzzer-sys for Rust. Maintain a corpus that grows over time.

End-to-end tests

A small set of high-fidelity tests that:

  1. Build a reference app (Transcribe / DocVault) end-to-end.
  2. Launch it as a real Tauri app in headless mode.
  3. Run user-style interactions via the webview.
  4. Assert capability-correct behavior.

These are slow (~30s per test). Live in apps/transcribe/tests/integration.test.ts and similar.

CI runs E2E only on pre-release branches and nightly, not on every PR.

Capability scenario tests (special category)

For each capability in the spec, we maintain a paired test that:

  1. Declares the capability in a manifest.
  2. Exercises it through the SDK.
  3. Verifies the runtime allows it.
  4. Removes the declaration.
  5. Verifies the runtime denies the same call.

Example: tests/capabilities/net_scope.rs covers net: false, net: true, net: { allowed_hosts: ['api.example.com'] }, etc., across many scenarios.

This catalog grows with every new capability. Adding a capability means adding the tests. Non-negotiable.

Adversarial / red-team tests

A subset of tests written from an attacker’s perspective. Examples:

  • Try to escape the macOS sandbox via Tauri IPC quirks.
  • Try to inject undeclared capabilities via crafted manifest fields.
  • Try to read parent-directory files via path traversal.
  • Try to evade static analysis via dead code, eval-like patterns, dynamic imports.
  • Try to exhaust runtime resources (fork bomb in wasm tool, huge model loading).
  • Try to spoof signed artifacts.

Each finding becomes a permanent regression test.

Performance regression tests

locara test --bench runs reference workloads against 21-performance-budgets.md targets:

  • Cold-start time.
  • LLM TTFT (time to first token) for canonical models.
  • SQLite query latency on canonical schemas.
  • Vector search latency at canonical scales.
  • Memory usage during a known workload.

CI compares against the previous release; regressions > 10% fail the build (subject to manual review for false positives).

Reference hardware: GitHub-hosted Apple Silicon runners (M-series, when available); fall back to local benchmarks committed to repo for trend analysis.

CI matrix

GitHub Actions runs:

TriggerWhat runs
Every PRUnit + integration + lint + typecheck on macOS arm64
PR with [bench] tagPerformance regression tests
PR to mainFull unit + integration + property-based + clippy + audit
NightlyE2E + fuzz + memory-leak detection (valgrind/sanitizers) + security audit (cargo-audit + npm-audit)
Pre-release branchEverything above + manual smoke tests

Cache: aggressive cargo + pnpm caches; expect <2 min for typical PR CI runs.

Test data + fixtures

  • tests/fixtures/manifests/ — canonical manifests covering each capability combination.
  • tests/fixtures/apps/ — minimal Locara apps used in integration + E2E tests.
  • tests/fixtures/models/ — small dummy models that exercise the loader without requiring real GB-scale weights.
  • tests/fixtures/audio/ — short audio clips for STT testing.
  • tests/fixtures/documents/ — sample PDFs / images for OCR testing.

All fixtures are version-controlled and small (<100MB total).

What we don’t test (explicitly)

  • Apple’s frameworks. We assume macOS App Sandbox works as Apple documents. Bug reports there go to Apple.
  • llama.cpp / MLX. We assume the inference engine produces correct tokens. Bugs there go upstream.
  • Hardware quirks. We don’t test on every M-series chip; we use the GitHub runner pool + dogfooding.
  • Real-world model behavior. We don’t test that “Qwen produces good answers.” We test that we invoke it correctly.

Test-driven security review

When a new feature is added, the developer answers (in PR template):

  1. Does this introduce a new capability or extend an existing one?
  2. If yes, what tests cover the negative case (declared = not allowed)?
  3. What property tests cover invariants?
  4. Has the static analyzer been updated?
  5. Has fuzz coverage been considered?

PRs without these answers don’t merge. Pure rote, but it works.

Crash / panic policy

The Locara runtime should not panic on any user input or any reasonable internal state. Panics in core/runtime/storage/models/tools are treated as security-grade bugs because they can be triggered by malicious apps to take down the host.

Use Result types extensively; reserve panic! for genuinely impossible cases (and even then, add a test that proves they’re impossible).

When a panic does occur in the wild:

  • Logged locally.
  • User can opt to share logs via locara doctor --export-logs (no auto-upload).
  • Treated as a critical bug; hotfix released.

Open questions

  • (open) Code coverage tooling for Rust (tarpaulin vs grcov vs llvm-cov). Probably llvm-cov once it stabilizes; otherwise tarpaulin.
  • (open) Snapshot testing for component output? Probably yes for @locara/components; not for capability-enforcement layers.
  • (open) Mutation testing? Useful for the capability enforcer but adds significant CI time. Phase 4+ if at all.
  • (open) Canary releases of the framework — releasing to a small percentage of the userbase first. Probably no until phase 4; framework doesn’t auto-update silently.

Cross-references