30 — Testing Strategy
How the Locara framework itself is tested. (Testing of apps built on Locara is covered by @locara/test; see 05-sdk.md and 06-cli.md.)
The capability enforcement layer is where bugs become security incidents. Testing has to be honest about what can fail and design accordingly.
Layered test strategy
┌──────────────────────────────────────────────────────────┐
│ E2E tests (real apps run on real runtime) │
├──────────────────────────────────────────────────────────┤
│ Integration tests (per-crate boundary tests) │
├──────────────────────────────────────────────────────────┤
│ Unit tests (per-function, deterministic) │
├──────────────────────────────────────────────────────────┤
│ Property-based tests (capability invariants, fuzzing) │
├──────────────────────────────────────────────────────────┤
│ Static analysis (clippy, eslint, custom rules) │
└──────────────────────────────────────────────────────────┘
Every level catches different bugs. Skipping any one shifts cost downstream.
Coverage targets
These are enforced in CI:
| Layer | Coverage target | Enforcement |
|---|---|---|
locara-core (capability enforcer, model loader) | 90%+ line, 85%+ branch | CI fails below threshold |
locara-storage | 85%+ line | CI warns; review required |
locara-models (fetch, hash verification) | 90%+ line | CI fails below threshold |
locara-tools (Wasmtime sandbox) | 90%+ line | CI fails below threshold |
locara-runtime (IPC, lifecycle) | 85%+ | CI warns |
locara-cli | 70%+ | CI warns |
@locara/sdk | 85%+ | CI warns |
@locara/components | 70%+ (UI components are partly visual) | CI warns |
| Reference apps | best-effort | Not enforced |
Capability-enforcing code paths are 100% covered. No exceptions. Untested capability check = security hole.
Unit tests
Standard per-function tests. Fast, deterministic, no I/O.
Conventions:
- Rust:
#[test]inmod testsat the bottom of the file.cargo testruns them. - TypeScript:
vitestper package.<file>.test.tscolocated.
Naming: test_<thing_being_tested>_<scenario>_<expected>. Example: test_capability_check_undeclared_net_returns_denied.
Integration tests
Cross-crate / cross-package tests at the public API boundary.
Examples:
locara-core+locara-storage: load a model, run inference, write result to sqlite.locara-runtime+locara-tools: app invokes a wasm tool with a scoped capability set.@locara/sdk+locara-runtime: callllm.chat, verify it routes through Tauri IPC and respects manifest declarations.
Live in crates/<name>/tests/ (Rust) or packages/<name>/tests/integration/ (TS).
Property-based + fuzzing tests
Critical for the capability enforcer. Use proptest (Rust) and fast-check (TypeScript) to generate inputs.
Invariants we test as properties:
- No undeclared capability succeeds. For any randomly-generated manifest + any randomly-generated SDK call, if the manifest doesn’t declare the relevant capability, the call fails.
- Scope checks are honored. For any
fs.read: ["~/scoped/**"]declaration, no path outside that glob can be read, regardless of how the path is constructed (relative, symlinks,.., URL-encoding tricks). - Net allowlist is honored. For any
net: { allowed_hosts: [...] }declaration, no outbound call to a host outside the list succeeds. - Tool capability composition. For any tool requiring capability X, if the hosting app doesn’t declare X, the tool refuses to load.
- Capability cool-down monotonicity. A capability declared at version N+1 that’s broader than at version N triggers cool-down; narrowing never does.
- Manifest schema invariants. A valid manifest must validate; invalid manifests must fail validation (no false positives or negatives).
Fuzz targets (run continuously in CI nightly):
- Manifest parser (give it garbage, must not crash).
- Path-scope checker (give it adversarial paths, must not allow escape).
- Wasm tool invocation (give it malicious wasm, must not let it escape).
- Static analyzer (give it adversarial source code, must not miss capability uses).
Use cargo-fuzz or libfuzzer-sys for Rust. Maintain a corpus that grows over time.
End-to-end tests
A small set of high-fidelity tests that:
- Build a reference app (Transcribe / DocVault) end-to-end.
- Launch it as a real Tauri app in headless mode.
- Run user-style interactions via the webview.
- Assert capability-correct behavior.
These are slow (~30s per test). Live in apps/transcribe/tests/integration.test.ts and similar.
CI runs E2E only on pre-release branches and nightly, not on every PR.
Capability scenario tests (special category)
For each capability in the spec, we maintain a paired test that:
- Declares the capability in a manifest.
- Exercises it through the SDK.
- Verifies the runtime allows it.
- Removes the declaration.
- Verifies the runtime denies the same call.
Example: tests/capabilities/net_scope.rs covers net: false, net: true, net: { allowed_hosts: ['api.example.com'] }, etc., across many scenarios.
This catalog grows with every new capability. Adding a capability means adding the tests. Non-negotiable.
Adversarial / red-team tests
A subset of tests written from an attacker’s perspective. Examples:
- Try to escape the macOS sandbox via Tauri IPC quirks.
- Try to inject undeclared capabilities via crafted manifest fields.
- Try to read parent-directory files via path traversal.
- Try to evade static analysis via dead code, eval-like patterns, dynamic imports.
- Try to exhaust runtime resources (fork bomb in wasm tool, huge model loading).
- Try to spoof signed artifacts.
Each finding becomes a permanent regression test.
Performance regression tests
locara test --bench runs reference workloads against 21-performance-budgets.md targets:
- Cold-start time.
- LLM TTFT (time to first token) for canonical models.
- SQLite query latency on canonical schemas.
- Vector search latency at canonical scales.
- Memory usage during a known workload.
CI compares against the previous release; regressions > 10% fail the build (subject to manual review for false positives).
Reference hardware: GitHub-hosted Apple Silicon runners (M-series, when available); fall back to local benchmarks committed to repo for trend analysis.
CI matrix
GitHub Actions runs:
| Trigger | What runs |
|---|---|
| Every PR | Unit + integration + lint + typecheck on macOS arm64 |
PR with [bench] tag | Performance regression tests |
| PR to main | Full unit + integration + property-based + clippy + audit |
| Nightly | E2E + fuzz + memory-leak detection (valgrind/sanitizers) + security audit (cargo-audit + npm-audit) |
| Pre-release branch | Everything above + manual smoke tests |
Cache: aggressive cargo + pnpm caches; expect <2 min for typical PR CI runs.
Test data + fixtures
tests/fixtures/manifests/— canonical manifests covering each capability combination.tests/fixtures/apps/— minimal Locara apps used in integration + E2E tests.tests/fixtures/models/— small dummy models that exercise the loader without requiring real GB-scale weights.tests/fixtures/audio/— short audio clips for STT testing.tests/fixtures/documents/— sample PDFs / images for OCR testing.
All fixtures are version-controlled and small (<100MB total).
What we don’t test (explicitly)
- Apple’s frameworks. We assume macOS App Sandbox works as Apple documents. Bug reports there go to Apple.
- llama.cpp / MLX. We assume the inference engine produces correct tokens. Bugs there go upstream.
- Hardware quirks. We don’t test on every M-series chip; we use the GitHub runner pool + dogfooding.
- Real-world model behavior. We don’t test that “Qwen produces good answers.” We test that we invoke it correctly.
Test-driven security review
When a new feature is added, the developer answers (in PR template):
- Does this introduce a new capability or extend an existing one?
- If yes, what tests cover the negative case (declared = not allowed)?
- What property tests cover invariants?
- Has the static analyzer been updated?
- Has fuzz coverage been considered?
PRs without these answers don’t merge. Pure rote, but it works.
Crash / panic policy
The Locara runtime should not panic on any user input or any reasonable internal state. Panics in core/runtime/storage/models/tools are treated as security-grade bugs because they can be triggered by malicious apps to take down the host.
Use Result types extensively; reserve panic! for genuinely impossible cases (and even then, add a test that proves they’re impossible).
When a panic does occur in the wild:
- Logged locally.
- User can opt to share logs via
locara doctor --export-logs(no auto-upload). - Treated as a critical bug; hotfix released.
Open questions
- (open) Code coverage tooling for Rust (
tarpaulinvsgrcovvsllvm-cov). Probablyllvm-covonce it stabilizes; otherwisetarpaulin. - (open) Snapshot testing for component output? Probably yes for
@locara/components; not for capability-enforcement layers. - (open) Mutation testing? Useful for the capability enforcer but adds significant CI time. Phase 4+ if at all.
- (open) Canary releases of the framework — releasing to a small percentage of the userbase first. Probably no until phase 4; framework doesn’t auto-update silently.
Cross-references
- Performance budgets being tested: 21-performance-budgets.md
- Capability model being tested: 03-capabilities.md
- Static analyzer details: 31-capability-analyzer.md
- Runtime: 07-runtime.md
- Repo layout: 17-repo-layout.md
- Build pipeline: 16-build.md