27 — Crisis Runbook
What to do when things go wrong. Pre-written playbooks for the high-impact scenarios so a 3am incident becomes a procedure, not a panic.
This is a living document. Update after every real incident. The first useful version is one written before any incident; the best version emerges from experience.
Severity levels
| Severity | Definition | Response time |
|---|---|---|
| SEV-1 | Active harm to users or registry integrity. Examples: malicious app slipped into registry; signing key compromised; CSAM detected. | Immediate (< 1 hour). |
| SEV-2 | Service degraded for many users. Examples: registry down; CDN unavailable; CI broken; Apple cert revoked. | < 4 hours. |
| SEV-3 | Service degraded for some users or non-critical issue. Examples: search broken on website; particular model fetch failing; notarization queue backed up. | < 24 hours. |
| SEV-4 | Cosmetic / non-blocking. Examples: typo in docs; broken external link; minor UI glitch. | Best-effort. |
General response framework
Every incident follows the same arc:
- Detect. Either reported (user, security researcher, automated alert) or noticed.
- Acknowledge. Within the response window, confirm receipt.
- Assess. Severity, scope, blast radius.
- Contain. Stop bleeding. May involve revocation, takedown, public notice.
- Investigate. Root cause, what failed, who was affected.
- Remediate. Fix the underlying issue.
- Communicate. Public postmortem within 30 days for any SEV-1 or SEV-2.
- Improve. Add safeguards. Update this runbook.
Specific scenarios
Scenario A — Malicious app in the registry
Detection: User report, automated anomaly detection (v2+), or proactive review.
Initial response (within 1 hour):
- Confirm: is the app actually malicious? What does it do? What scope of harm?
- Activate kill-switch for the affected app version(s). All installed Locara apps will refuse to launch on next start (kill-switch hit at runtime).
- Disable publisher account for the affected publisher pending investigation.
- Take down affected app pages on the registry frontend.
- Public notice on the security advisory page.
Investigation (within 24 hours):
- How did it pass review? What automated checks missed it?
- Is the publisher compromised, or is this a malicious actor?
- What other apps from this publisher exist? Are they suspect?
- What did affected users experience? Any data exposure?
Remediation:
- If account compromise: help legitimate publisher recover; root-cause the auth weakness.
- If malicious actor: permanent ban; legal notice if applicable; report to law enforcement for serious cases (CSAM, malware).
- Improve automated checks to catch similar patterns.
Communication:
- Public postmortem within 30 days.
- Affected users notified via in-app banner the next time they launch the affected app.
- Detailed timeline, what failed, what changed.
Reference: kill-switch mechanics in 14-trust-safety.md.
Scenario B — Locara CI signing key compromised
Detection: Anomaly in attestation log, security researcher report, or internal discovery.
Initial response (within 1 hour):
- Stop CI. No new builds until rotation complete.
- Revoke the compromised key — publish revocation in Sigstore log.
- Identify the time window the key was potentially compromised.
- Quarantine all artifacts signed during that window. Each Locara app refuses to install or launch them pending re-verification.
- Public notice on security advisory page within 4 hours.
Investigation:
- How was the key obtained? (Phishing? Repo leak? Malicious commit? Insider?)
- Were any malicious artifacts signed with the key?
- What’s the blast radius?
Remediation:
- Generate new CI signing key.
- Re-sign legitimate artifacts in the affected window with the new key.
- Update each Locara app’s bundled trust list (via app updates) to include the new key and exclude the old.
- If malicious artifacts were signed: kill-switch them (Scenario A protocol).
- Audit + harden key management.
Communication:
- Public postmortem within 14 days (faster than 30 because of high stakes).
- Detailed remediation steps published.
Reference: signing details in 16-build.md.
Scenario C — Apple Developer cert revoked
Detection: Notification from Apple, or notarization API returns errors.
Initial response (within 4 hours):
- Identify why. Apple’s reasons: technical (cert expired), policy violation, fraud detection.
- Public notice that Locara CI is paused; existing apps still work; new publishes blocked temporarily.
- Contact Apple to resolve.
Investigation:
- Was a published app flagged for malware? (Then Scenario A applies.)
- Is the cert expired? (Renew.)
- Is there a policy disagreement? (Negotiate.)
Remediation:
- Get new cert / restore existing.
- Re-sign + re-notarize affected artifacts.
- Resume CI.
Communication:
- Status updates every 4 hours during outage.
- Postmortem if outage > 24 hours.
Considerations:
- Apple is a single point of failure for macOS distribution. If we’re permanently locked out, the project pivots to: continue Mac via direct distribution (with Gatekeeper warnings, suboptimal); accelerate Linux/Windows. Worst case discussed in advance with maintainers.
Scenario D — Registry frontend / API outage
Detection: Cloudflare alert, user reports, or our own monitoring.
Initial response (within 4 hours):
- Identify cause. Cloudflare? Our DNS? Our backend?
- Status page update — clear, honest message about what’s down.
- Mitigate if possible — Cloudflare’s status pages, fallback, etc.
Investigation + Remediation:
- Standard incident response.
- If backend bug: fix forward.
- If infrastructure: usually wait + monitor.
Communication:
- Status page updates every hour.
- Postmortem if outage > 4 hours.
Considerations:
- Installed Locara apps don’t depend on the registry to run — only to check for updates. During a registry outage, users keep using their installed apps; only browsing locara.app or fetching updates fails.
Scenario E — Cloudflare R2 / CDN outage
Detection: Apps fail to install (model fetches time out).
Initial response (within 4 hours):
- Status page update.
- Identify cause via Cloudflare’s status.
- If our config issue: fix.
- If Cloudflare issue: wait + monitor.
Mitigation:
- v1: nothing besides waiting.
- Future: backup CDN failover (cost vs benefit decision; probably not until phase 4).
Scenario F — Sigstore / transparency log unavailable
Detection: Provenance verification fails on Locara apps’ first launch.
Initial response (within 4 hours):
- Identify cause. Sigstore public good is operated by Linux Foundation.
- Decide: can we keep installs flowing without provenance verification? Probably no, since it’s part of the trust model.
- Status page update.
Mitigation:
- Cached provenance for previously-installed apps (re-installs work).
- New installs may be blocked during outage.
- If outage long: consider self-hosting Sigstore (Rekor) as a backup; cost.
Scenario G — Domain hijacked
Detection: Someone registers a similar domain pretending to be Locara, or our registrar account is compromised.
Initial response (within 1 hour):
- Confirm scope. Just a similar domain (typosquat)? Or our actual domain?
- If our domain: contact registrar; freeze the account; recover via TOTP / recovery process.
- If typosquat: report to registrar; legal notice if needed.
Containment:
- Each Locara app pins certificates (cert pinning) for the registry’s manifest API to defend against MITM in this scenario.
- Active typosquats may host malicious download.
Public notice:
- “Don’t click links to
; the real Locara is .”
Scenario H — GitHub outage (long)
GitHub is single point of failure for: source repos, CI runners (currently), issue tracking, OAuth login.
Detection: GitHub status page; CI fails.
Initial response:
- Status page update.
- For short outages: wait.
- For long outages: defer publishes; users still install + use existing apps.
Mitigation (long-term):
- Mirror critical repos to GitLab / Codeberg.
- Self-hosted CI runners as backup option (Phase 4+).
- Keep the publisher submission flow functional without GitHub OAuth (anonymous publisher accounts as fallback option).
Scenario I — Project lead incapacitated
The bus-factor scenario.
Initial response (within 7 days):
- Maintainers convene. Confirm incapacitation.
- Activate succession protocol per 23-governance.md.
- Communicate to users — “the project continues; here’s the new lead.”
Containment:
- Project lead’s signing keys, registrar accounts, Apple Developer Program access stored in escrow with documented recovery process.
- No private secrets that only one person knows.
Continuity:
- New lead takes over.
- Project continues with existing maintainers.
This is documented in 23-governance.md but worth restating: we don’t depend on one person being available.
Scenario J — Major framework vulnerability
A bug in @locara/sdk or locara-runtime lets apps escape capability constraints.
Initial response (within 4 hours of report):
- Validate the vulnerability privately.
- Severity assessment. Critical / High / Medium / Low.
- For Critical:
- Hotfix prepared in private branch.
- Coordinated disclosure to affected publishers (give them time to update).
- Public advisory + patched release within 24–72 hours.
- For non-Critical:
- Standard patch in next release cycle.
- Public advisory after patch ships.
Communication:
- CVE issued where applicable.
- Affected version range published.
- Mitigation steps for users on old versions.
Reference: disclosure process in 13-security-privacy.md.
Communication templates
SEV-1 initial advisory
[SEV-1 ADVISORY] <Issue title> — <date>
We're investigating <brief description>.
Affected: <user count or scope>
Mitigation: <what users should do>
Status page: <link>
We will post updates every <frequency> until resolved.
SEV-2 initial advisory
[SEV-2] <Issue title> — <date>
<Brief description>.
Status: investigating | mitigated | resolved
Next update: <time>
Postmortem template
See docs/postmortems/template.md (TBD; one per real incident).
Required sections:
- Summary
- Timeline (UTC)
- Root cause
- Impact
- What went well
- What went poorly
- Action items (with owners)
Pre-flight checklist (before going public)
Before phase 0 closes, ensure:
- Incident response email (
security@<domain>) is monitored. - Status page is set up (
status.<domain>or similar). - Project lead has 2FA on GitHub, registrar, Apple Developer, Cloudflare.
- Signing keys are escrow’d with documented recovery.
- Maintainers (when they exist) have on-call rotation defined.
- Public security advisory page exists (
<domain>/security). - PGP key for security disclosure published.
What’s not in this runbook (yet)
- Specific monitoring + alerting infrastructure. Depends on tools we adopt (probably Cloudflare’s built-in + GitHub’s email alerts in v1).
- Specific on-call rotation. Until phase 4, the project lead is on-call by default.
- Specific external contacts (Cloudflare support, Apple support, GitHub support). Will document as relationships are built.
Drills
Until incidents happen, the runbook is theoretical. We should run drills quarterly:
- Drill 1: Kill-switch test. Manually mark a test app as “revoked” in the registry; verify installed Locara apps refuse to launch on next start. Reverse the revocation; verify recovery.
- Drill 2: Signing key rotation. Walk through the rotation procedure end-to-end on a test key.
- Drill 3: Status page + comms. Pretend SEV-2 incident; practice the comms cadence.
- Drill 4: Succession. Simulate the project lead being unreachable for a week; can maintainers continue operations?
Drills are documented; gaps surfaced go into action items.
Open questions
- (open) What’s the right severity threshold for paging the project lead at 3am? Initial leaning: SEV-1 only. SEV-2 can wait until morning.
- (open) Bug bounty integration with the runbook? Probably yes once active.
- (open) External SOC / monitoring service for the registry — phase 4+?
Cross-references
- Trust + safety: 14-trust-safety.md (kill-switch, takedowns)
- Security philosophy: 13-security-privacy.md (disclosure process)
- Build pipeline: 16-build.md (CI + signing)
- Governance + succession: 23-governance.md