Locara

04 — Modalities & Tooling

The two first-class declarations in a Locara app: what input/output transformations does it perform (modalities), and what tools can the LLM or app code call (tooling).

These are the two lenses developers think in when designing an app. Underneath, both expand into the lower-level capabilities, models, and SDK calls. But the developer-facing surface stays at this higher level.

Why two declarations, not one big capability bag

A flat capability list is fine for the runtime to enforce, but it’s terrible for the developer to author or for an AI agent to fill in. “I want to build a voice assistant” should not require knowing:

  • device.microphone: true
  • A whisper-class model in models[]
  • An LLM model in models[]
  • (Maybe) a TTS model
  • Audio recording capability
  • A way to play audio output

Instead, declare:

"modalities": ["voice-to-voice"]

…and the framework expands all of it. The developer can override specifics, but the default expansion produces a working app.

Same for tooling. “I want my chat app to do OCR on uploaded PDFs” should be:

"tooling": ["ocr"]

…not a hand-wired list of OCR model + filesystem capabilities + SDK access.

Modalities

A modality is an input-shape → output-shape transformation. Locara v1 defines a closed taxonomy:

ModalityInputsOutputsImplies
text-to-texttexttext (streaming)LLM model + llm.chat SDK
text-to-text-thinkingtexttext + reasoning traceLLM model with reasoning support + llm.chat({ thinking: true })
speech-to-textaudio file or live mictext + segmentsSTT model + transcribe.* SDK + (live) device.microphone
text-to-speechtextaudioTTS model + tts.* SDK + audio.play
voice-to-voicelive micaudioSTT + LLM + TTS chain + device.microphone + audio.play
image-to-textimagetext description / OCR / structuredVLM or OCR model + vlm.* or ocr.* SDK + fs.user-selected
text-to-imagetextimageImage-gen model + image.generate SDK
text-to-embeddingtextfloat vectorEmbedding model + embed.* SDK
audio-to-embeddingaudiofloat vectorAudio embedding model + embed.audio SDK

Each modality is a named bundle with a stable expansion. Adding a new modality to the framework = a one-time decision; apps that adopt it inherit the expansion.

Declaring a modality (simple form)

"modalities": ["speech-to-text", "text-to-text"]

Picks Locara-default models and capabilities. Works for ~80% of apps.

Declaring a modality (overridden form)

"modalities": [
  { 
    "type": "speech-to-text",
    "model": "whisper-large-v3-q4@sha256:abc...",
    "live": true
  },
  {
    "type": "text-to-text",
    "model": "qwen2.5-3b-instruct-q4@sha256:def..."
  }
]

Override picks specific models, enables/disables live mode, etc.

Modality expansion

When locara dev or locara build reads the manifest, modalities expand into capabilities/models/SDK:

"modalities": ["speech-to-text"]

"capabilities": {
  "device.microphone": true,
  "fs.user-selected": "read",   // for file-based transcription
  "models": ["whisper-large-v3-q4@sha256:..."]
},
"sdk_modules": ["transcribe", "audio"]

The expansion is deterministic and inspectablelocara verify --explain shows the full expansion, so developers can see exactly what was added.

If a developer wants more granularity than a modality offers (e.g., they want STT but specifically NOT live mode), they use the overridden form or fall back to declaring capabilities directly.

Per-profile modality overrides

Modalities can vary by device profile (see 02-manifest.md profiles):

"modalities": [
  { "type": "text-to-text", "model": "qwen2.5-3b-q4" }
],
"profiles": {
  "low":  { "modalities_override": [{ "type": "text-to-text", "model": "qwen-1.5b-q4" }] },
  "high": { "modalities_override": [{ "type": "text-to-text", "model": "qwen-7b-q4" }] }
}

The runtime picks the right model size for the user’s hardware automatically. Developer thinks in modalities; users get the right model.

Tooling

A tool is something an LLM or the app’s code can call to act on the world. Locara treats tooling as first-class — there’s a curated registry of built-in tools, plus a path for app-specific custom tools.

Built-in tools (v1 catalog)

These are signed, audited, sandboxed wasm modules shipped as part of the Locara registry. Apps opt in by declaration:

ToolDoesCapabilities required
ocrExtract text + structure from image/PDFmodel: GLM-OCR or RapidOCR
filesystem.searchSearch user-selected directory by name/contentfs.user-selected
filesystem.readRead user-selected file as text/bytesfs.user-selected
filesystem.writeWrite to user-selected locationfs.user-selected: "read-write"
code-exec.pythonExecute Python in a wasm sandbox (Pyodide-style)none (sandbox-isolated)
code-exec.jsExecute JS in a wasm sandboxnone
bash.read-onlyRun safe read-only shell commands (ls, grep, find)scoped fs
image.resizeResize/crop imagesnone
image.format-convertConvert image formatsnone
pdf.splitSplit PDFs by pagenone
pdf.extract-textExtract text from PDFnone
audio.transcribeTranscribe audio (uses STT modality if declared)inherits
web.fetchFetch a URLnet: { allowed_hosts: [...] } only
text.summarizeSummarize via LLMinherits LLM modality
text.translateTranslate via LLMinherits LLM modality

This catalog grows over time. Adding a new tool = a one-time PR with the wasm artifact, signed by Locara’s key, with documented capability requirements.

Declaring tooling

"tooling": ["ocr", "filesystem.search", "code-exec.python"]

Each entry must be in the curated registry. Apps opt into specific tools.

The runtime exposes them via the SDK:

import { tools } from '@locara/sdk'

// Direct invocation
const result = await tools.ocr({ source: pdfBlob })

// Or pass to an LLM as callable tools
const response = await llm.chat({
  model: 'qwen2.5-3b',
  messages: [...],
  tools: ['ocr', 'filesystem.search']  // LLM can choose to call
})

Tooling expansion

Just like modalities, tools expand into underlying capabilities + SDK:

"tooling": ["filesystem.search", "ocr"]

"capabilities": {
  "fs.user-selected": "read",
  "tools": [
    "wasm.locara.filesystem-search@1.0",
    "wasm.locara.ocr@1.2"
  ],
  "models": ["glm-ocr-1.5-q8@sha256:..."]   // OCR's model dependency added automatically
},
"sdk_modules": ["tools", "fs"]

Capability inheritance: tools cannot exceed app

A tool can never grant capabilities the app didn’t declare. If web.fetch requires net, but the app declares net: false, then web.fetch is unavailable in that app. The framework refuses at locara verify:

✗ Tool "web.fetch" requires capability "net" but the app declares net: false.
   Either: declare "net: { allowed_hosts: [...] }" in capabilities, or remove web.fetch.

This composition rule is the backbone of the trust model — adding a tool can never sneakily expand the app’s reach.

Custom tools (app-specific)

For tools not in the curated registry, apps can ship their own wasm tools:

my-app/
├── locara.json
└── tools/
    └── my-custom-tool.wasm
"tooling": [
  "ocr",
  { "name": "my-custom-tool", "path": "./tools/my-custom-tool.wasm", "signature": "...", "capabilities": [...] }
]

Custom tools:

  • Must be signed by the publisher.
  • Declare their own capability requirements.
  • Run in the same Wasmtime sandbox as registry tools.
  • Get reviewed alongside the app at submission time (capability requirements must be a subset of app’s).

See 10-tools.md for tool runtime details.

The “vibe” of declarations

The three layers, top to bottom:

Modalities + Tooling          ← what developers write
       ↓ (expansion)
Capabilities                  ← what gets enforced
       ↓ (mapping)
macOS entitlements + Tauri permissions + Locara runtime checks

Developers write in the top layer 90% of the time. The middle layer is for power users / unusual cases. The bottom layer is for the runtime.

This means a typical locara.json for a voice-assistant app looks like:

{
  "schema": "locara/v1",
  "name": "voice-assistant",
  "publisher": "kingtongchoo",
  "version": "0.1.0",

  "displayName": "Voice Assistant",
  "description": "Fully-local voice assistant.",
  "license": "Apache-2.0",
  "icon": "./public/icon.png",
  "screenshots": ["./screenshots/main.png"],
  "category": "productivity",

  "modalities": [
    "voice-to-voice"
  ],
  "tooling": [
    "filesystem.search",
    "code-exec.python"
  ],

  "profiles": {
    "mid":  { "min_ram_gb": 16 },
    "high": { "min_ram_gb": 32 }
  },

  "storage": {
    "schema": "./db/schema.sql"
  }
}

That’s the entire manifest. The framework expands it into ~30 lines of capabilities + model declarations under the hood.

AI authoring with modalities + tooling

For agent-friendly authoring (Cursor, Claude, etc.), modalities + tooling are the right surface to expose. An LLM scaffolding a Locara app can produce a small, valid manifest from a natural-language prompt:

“Build me a meeting recorder that transcribes and lets me search past meetings.”

→ modalities: ["speech-to-text"], tooling: ["filesystem.search"], schema with a meetings table.

The LLM doesn’t have to know the entire capability schema — it just picks from the named modality/tool catalog. Hallucination of “fake capabilities” is impossible because the catalog is closed.

This is why modalities + tooling exist as a layer above capabilities. Capabilities are precise and machine-enforced; modalities + tooling are simple and human/AI-authored.

Versioning

Modalities and tools are versioned independently of the app:

  • text-to-text@1 is the v1 expansion. If we ever change what text-to-text includes, it becomes text-to-text@2.
  • Apps pin which version they expect: "text-to-text@1" (default = latest at publish time, frozen in lockfile).
  • Adding a new modality is a SemVer minor bump of the spec. Removing or changing one is major (rare, requires migration).

Same for tools — ocr@1.2 is a specific tool version with documented behavior and capabilities.

Open questions

  • (open) Should text-to-image be in v1, or punt to v2? Diffusion models are heavy and the legal landscape (training data, NSFW) is messy. Leaning punt.
  • (open) Naming for text-to-text-thinking — alternatives: reasoning, chain-of-thought. Bikeshed.
  • (open) Should “voice-to-voice” auto-include interruption / barge-in handling, or leave to apps?
  • (open) Custom modalities: can apps declare their own modality types, or only use the curated set? Leaning curated-only for v1 (fewer surprises in review).

Cross-references