27 — Crisis Runbook

What to do when things go wrong. Pre-written playbooks for the high-impact scenarios so a 3am incident becomes a procedure, not a panic.

This is a living document. Update after every real incident. The first useful version is one written before any incident; the best version emerges from experience.

Severity levels

Severity	Definition	Response time
SEV-1	Active harm to users or registry integrity. Examples: malicious app slipped into registry; signing key compromised; CSAM detected.	Immediate (< 1 hour).
SEV-2	Service degraded for many users. Examples: registry down; CDN unavailable; CI broken; Apple cert revoked.	< 4 hours.
SEV-3	Service degraded for some users or non-critical issue. Examples: search broken on website; particular model fetch failing; notarization queue backed up.	< 24 hours.
SEV-4	Cosmetic / non-blocking. Examples: typo in docs; broken external link; minor UI glitch.	Best-effort.

General response framework

Every incident follows the same arc:

Detect. Either reported (user, security researcher, automated alert) or noticed.
Acknowledge. Within the response window, confirm receipt.
Assess. Severity, scope, blast radius.
Contain. Stop bleeding. May involve revocation, takedown, public notice.
Investigate. Root cause, what failed, who was affected.
Remediate. Fix the underlying issue.
Communicate. Public postmortem within 30 days for any SEV-1 or SEV-2.
Improve. Add safeguards. Update this runbook.

Specific scenarios

Scenario A — Malicious app in the registry

Detection: User report, automated anomaly detection (v2+), or proactive review.

Initial response (within 1 hour):

Confirm: is the app actually malicious? What does it do? What scope of harm?
Activate kill-switch for the affected app version(s). All installed Locara apps will refuse to launch on next start (kill-switch hit at runtime).
Disable publisher account for the affected publisher pending investigation.
Take down affected app pages on the registry frontend.
Public notice on the security advisory page.

Investigation (within 24 hours):

How did it pass review? What automated checks missed it?
Is the publisher compromised, or is this a malicious actor?
What other apps from this publisher exist? Are they suspect?
What did affected users experience? Any data exposure?

Remediation:

If account compromise: help legitimate publisher recover; root-cause the auth weakness.
If malicious actor: permanent ban; legal notice if applicable; report to law enforcement for serious cases (CSAM, malware).
Improve automated checks to catch similar patterns.

Communication:

Public postmortem within 30 days.
Affected users notified via in-app banner the next time they launch the affected app.
Detailed timeline, what failed, what changed.

Reference: kill-switch mechanics in 14-trust-safety.md.

Scenario B — Locara CI signing key compromised

Detection: Anomaly in attestation log, security researcher report, or internal discovery.

Initial response (within 1 hour):

Stop CI. No new builds until rotation complete.
Revoke the compromised key — publish revocation in Sigstore log.
Identify the time window the key was potentially compromised.
Quarantine all artifacts signed during that window. Each Locara app refuses to install or launch them pending re-verification.
Public notice on security advisory page within 4 hours.

Investigation:

How was the key obtained? (Phishing? Repo leak? Malicious commit? Insider?)
Were any malicious artifacts signed with the key?
What’s the blast radius?

Remediation:

Generate new CI signing key.
Re-sign legitimate artifacts in the affected window with the new key.
Update each Locara app’s bundled trust list (via app updates) to include the new key and exclude the old.
If malicious artifacts were signed: kill-switch them (Scenario A protocol).
Audit + harden key management.

Communication:

Public postmortem within 14 days (faster than 30 because of high stakes).
Detailed remediation steps published.

Reference: signing details in 16-build.md.

Scenario C — Apple Developer cert revoked

Detection: Notification from Apple, or notarization API returns errors.

Initial response (within 4 hours):

Identify why. Apple’s reasons: technical (cert expired), policy violation, fraud detection.
Public notice that Locara CI is paused; existing apps still work; new publishes blocked temporarily.
Contact Apple to resolve.

Investigation:

Was a published app flagged for malware? (Then Scenario A applies.)
Is the cert expired? (Renew.)
Is there a policy disagreement? (Negotiate.)

Remediation:

Get new cert / restore existing.
Re-sign + re-notarize affected artifacts.
Resume CI.

Communication:

Status updates every 4 hours during outage.
Postmortem if outage > 24 hours.

Considerations:

Apple is a single point of failure for macOS distribution. If we’re permanently locked out, the project pivots to: continue Mac via direct distribution (with Gatekeeper warnings, suboptimal); accelerate Linux/Windows. Worst case discussed in advance with maintainers.

Scenario D — Registry frontend / API outage

Detection: Cloudflare alert, user reports, or our own monitoring.

Initial response (within 4 hours):

Identify cause. Cloudflare? Our DNS? Our backend?
Status page update — clear, honest message about what’s down.
Mitigate if possible — Cloudflare’s status pages, fallback, etc.

Investigation + Remediation:

Standard incident response.
If backend bug: fix forward.
If infrastructure: usually wait + monitor.

Communication:

Status page updates every hour.
Postmortem if outage > 4 hours.

Considerations:

Installed Locara apps don’t depend on the registry to run — only to check for updates. During a registry outage, users keep using their installed apps; only browsing locara.app or fetching updates fails.

Scenario E — Cloudflare R2 / CDN outage

Detection: Apps fail to install (model fetches time out).

Initial response (within 4 hours):

Status page update.
Identify cause via Cloudflare’s status.
If our config issue: fix.
If Cloudflare issue: wait + monitor.

Mitigation:

v1: nothing besides waiting.
Future: backup CDN failover (cost vs benefit decision; probably not until phase 4).

Scenario F — Sigstore / transparency log unavailable

Detection: Provenance verification fails on Locara apps’ first launch.

Initial response (within 4 hours):

Identify cause. Sigstore public good is operated by Linux Foundation.
Decide: can we keep installs flowing without provenance verification? Probably no, since it’s part of the trust model.
Status page update.

Mitigation:

Cached provenance for previously-installed apps (re-installs work).
New installs may be blocked during outage.
If outage long: consider self-hosting Sigstore (Rekor) as a backup; cost.

Scenario G — Domain hijacked

Detection: Someone registers a similar domain pretending to be Locara, or our registrar account is compromised.

Initial response (within 1 hour):

Confirm scope. Just a similar domain (typosquat)? Or our actual domain?
If our domain: contact registrar; freeze the account; recover via TOTP / recovery process.
If typosquat: report to registrar; legal notice if needed.

Containment:

Each Locara app pins certificates (cert pinning) for the registry’s manifest API to defend against MITM in this scenario.
Active typosquats may host malicious download.

Public notice:

“Don’t click links to ; the real Locara is .”

Scenario H — GitHub outage (long)

GitHub is single point of failure for: source repos, CI runners (currently), issue tracking, OAuth login.

Detection: GitHub status page; CI fails.

Initial response:

Status page update.
For short outages: wait.
For long outages: defer publishes; users still install + use existing apps.

Mitigation (long-term):

Mirror critical repos to GitLab / Codeberg.
Self-hosted CI runners as backup option (Phase 4+).
Keep the publisher submission flow functional without GitHub OAuth (anonymous publisher accounts as fallback option).

Scenario I — Project lead incapacitated

The bus-factor scenario.

Initial response (within 7 days):

Maintainers convene. Confirm incapacitation.
Activate succession protocol per 23-governance.md.
Communicate to users — “the project continues; here’s the new lead.”

Containment:

Project lead’s signing keys, registrar accounts, Apple Developer Program access stored in escrow with documented recovery process.
No private secrets that only one person knows.

Continuity:

New lead takes over.
Project continues with existing maintainers.

This is documented in 23-governance.md but worth restating: we don’t depend on one person being available.

Scenario J — Major framework vulnerability

A bug in @locara/sdk or locara-runtime lets apps escape capability constraints.

Initial response (within 4 hours of report):

Validate the vulnerability privately.
Severity assessment. Critical / High / Medium / Low.
For Critical:
- Hotfix prepared in private branch.
- Coordinated disclosure to affected publishers (give them time to update).
- Public advisory + patched release within 24–72 hours.
For non-Critical:
- Standard patch in next release cycle.
- Public advisory after patch ships.

Communication:

CVE issued where applicable.
Affected version range published.
Mitigation steps for users on old versions.

Reference: disclosure process in 13-security-privacy.md.

Communication templates

SEV-1 initial advisory

[SEV-1 ADVISORY] <Issue title> — <date>

We're investigating <brief description>.

Affected: <user count or scope>
Mitigation: <what users should do>
Status page: <link>

We will post updates every <frequency> until resolved.

SEV-2 initial advisory

[SEV-2] <Issue title> — <date>

<Brief description>.

Status: investigating | mitigated | resolved
Next update: <time>

Postmortem template

See docs/postmortems/template.md (TBD; one per real incident).

Required sections:

Summary
Timeline (UTC)
Root cause
Impact
What went well
What went poorly
Action items (with owners)

Pre-flight checklist (before going public)

Before phase 0 closes, ensure:

Incident response email (security@<domain>) is monitored.
Status page is set up (status.<domain> or similar).
Project lead has 2FA on GitHub, registrar, Apple Developer, Cloudflare.
Signing keys are escrow’d with documented recovery.
Maintainers (when they exist) have on-call rotation defined.
Public security advisory page exists (<domain>/security).
PGP key for security disclosure published.

What’s not in this runbook (yet)

Specific monitoring + alerting infrastructure. Depends on tools we adopt (probably Cloudflare’s built-in + GitHub’s email alerts in v1).
Specific on-call rotation. Until phase 4, the project lead is on-call by default.
Specific external contacts (Cloudflare support, Apple support, GitHub support). Will document as relationships are built.

Drills

Until incidents happen, the runbook is theoretical. We should run drills quarterly:

Drill 1: Kill-switch test. Manually mark a test app as “revoked” in the registry; verify installed Locara apps refuse to launch on next start. Reverse the revocation; verify recovery.
Drill 2: Signing key rotation. Walk through the rotation procedure end-to-end on a test key.
Drill 3: Status page + comms. Pretend SEV-2 incident; practice the comms cadence.
Drill 4: Succession. Simulate the project lead being unreachable for a week; can maintainers continue operations?

Drills are documented; gaps surfaced go into action items.

Open questions

(open) What’s the right severity threshold for paging the project lead at 3am? Initial leaning: SEV-1 only. SEV-2 can wait until morning.
(open) Bug bounty integration with the runbook? Probably yes once active.
(open) External SOC / monitoring service for the registry — phase 4+?

Cross-references

Trust + safety: 14-trust-safety.md (kill-switch, takedowns)
Security philosophy: 13-security-privacy.md (disclosure process)
Build pipeline: 16-build.md (CI + signing)
Governance + succession: 23-governance.md