04 — Modalities & Tooling
The two first-class declarations in a Locara app: what input/output transformations does it perform (modalities), and what tools can the LLM or app code call (tooling).
These are the two lenses developers think in when designing an app. Underneath, both expand into the lower-level capabilities, models, and SDK calls. But the developer-facing surface stays at this higher level.
Why two declarations, not one big capability bag
A flat capability list is fine for the runtime to enforce, but it’s terrible for the developer to author or for an AI agent to fill in. “I want to build a voice assistant” should not require knowing:
device.microphone: true- A whisper-class model in
models[] - An LLM model in
models[] - (Maybe) a TTS model
- Audio recording capability
- A way to play audio output
Instead, declare:
"modalities": ["voice-to-voice"]
…and the framework expands all of it. The developer can override specifics, but the default expansion produces a working app.
Same for tooling. “I want my chat app to do OCR on uploaded PDFs” should be:
"tooling": ["ocr"]
…not a hand-wired list of OCR model + filesystem capabilities + SDK access.
Modalities
A modality is an input-shape → output-shape transformation. Locara v1 defines a closed taxonomy:
| Modality | Inputs | Outputs | Implies |
|---|---|---|---|
text-to-text | text | text (streaming) | LLM model + llm.chat SDK |
text-to-text-thinking | text | text + reasoning trace | LLM model with reasoning support + llm.chat({ thinking: true }) |
speech-to-text | audio file or live mic | text + segments | STT model + transcribe.* SDK + (live) device.microphone |
text-to-speech | text | audio | TTS model + tts.* SDK + audio.play |
voice-to-voice | live mic | audio | STT + LLM + TTS chain + device.microphone + audio.play |
image-to-text | image | text description / OCR / structured | VLM or OCR model + vlm.* or ocr.* SDK + fs.user-selected |
text-to-image | text | image | Image-gen model + image.generate SDK |
text-to-embedding | text | float vector | Embedding model + embed.* SDK |
audio-to-embedding | audio | float vector | Audio embedding model + embed.audio SDK |
Each modality is a named bundle with a stable expansion. Adding a new modality to the framework = a one-time decision; apps that adopt it inherit the expansion.
Declaring a modality (simple form)
"modalities": ["speech-to-text", "text-to-text"]
Picks Locara-default models and capabilities. Works for ~80% of apps.
Declaring a modality (overridden form)
"modalities": [
{
"type": "speech-to-text",
"model": "whisper-large-v3-q4@sha256:abc...",
"live": true
},
{
"type": "text-to-text",
"model": "qwen2.5-3b-instruct-q4@sha256:def..."
}
]
Override picks specific models, enables/disables live mode, etc.
Modality expansion
When locara dev or locara build reads the manifest, modalities expand into capabilities/models/SDK:
"modalities": ["speech-to-text"]
↓
"capabilities": {
"device.microphone": true,
"fs.user-selected": "read", // for file-based transcription
"models": ["whisper-large-v3-q4@sha256:..."]
},
"sdk_modules": ["transcribe", "audio"]
The expansion is deterministic and inspectable — locara verify --explain shows the full expansion, so developers can see exactly what was added.
If a developer wants more granularity than a modality offers (e.g., they want STT but specifically NOT live mode), they use the overridden form or fall back to declaring capabilities directly.
Per-profile modality overrides
Modalities can vary by device profile (see 02-manifest.md profiles):
"modalities": [
{ "type": "text-to-text", "model": "qwen2.5-3b-q4" }
],
"profiles": {
"low": { "modalities_override": [{ "type": "text-to-text", "model": "qwen-1.5b-q4" }] },
"high": { "modalities_override": [{ "type": "text-to-text", "model": "qwen-7b-q4" }] }
}
The runtime picks the right model size for the user’s hardware automatically. Developer thinks in modalities; users get the right model.
Tooling
A tool is something an LLM or the app’s code can call to act on the world. Locara treats tooling as first-class — there’s a curated registry of built-in tools, plus a path for app-specific custom tools.
Built-in tools (v1 catalog)
These are signed, audited, sandboxed wasm modules shipped as part of the Locara registry. Apps opt in by declaration:
| Tool | Does | Capabilities required |
|---|---|---|
ocr | Extract text + structure from image/PDF | model: GLM-OCR or RapidOCR |
filesystem.search | Search user-selected directory by name/content | fs.user-selected |
filesystem.read | Read user-selected file as text/bytes | fs.user-selected |
filesystem.write | Write to user-selected location | fs.user-selected: "read-write" |
code-exec.python | Execute Python in a wasm sandbox (Pyodide-style) | none (sandbox-isolated) |
code-exec.js | Execute JS in a wasm sandbox | none |
bash.read-only | Run safe read-only shell commands (ls, grep, find) | scoped fs |
image.resize | Resize/crop images | none |
image.format-convert | Convert image formats | none |
pdf.split | Split PDFs by page | none |
pdf.extract-text | Extract text from PDF | none |
audio.transcribe | Transcribe audio (uses STT modality if declared) | inherits |
web.fetch | Fetch a URL | net: { allowed_hosts: [...] } only |
text.summarize | Summarize via LLM | inherits LLM modality |
text.translate | Translate via LLM | inherits LLM modality |
This catalog grows over time. Adding a new tool = a one-time PR with the wasm artifact, signed by Locara’s key, with documented capability requirements.
Declaring tooling
"tooling": ["ocr", "filesystem.search", "code-exec.python"]
Each entry must be in the curated registry. Apps opt into specific tools.
The runtime exposes them via the SDK:
import { tools } from '@locara/sdk'
// Direct invocation
const result = await tools.ocr({ source: pdfBlob })
// Or pass to an LLM as callable tools
const response = await llm.chat({
model: 'qwen2.5-3b',
messages: [...],
tools: ['ocr', 'filesystem.search'] // LLM can choose to call
})
Tooling expansion
Just like modalities, tools expand into underlying capabilities + SDK:
"tooling": ["filesystem.search", "ocr"]
↓
"capabilities": {
"fs.user-selected": "read",
"tools": [
"wasm.locara.filesystem-search@1.0",
"wasm.locara.ocr@1.2"
],
"models": ["glm-ocr-1.5-q8@sha256:..."] // OCR's model dependency added automatically
},
"sdk_modules": ["tools", "fs"]
Capability inheritance: tools cannot exceed app
A tool can never grant capabilities the app didn’t declare. If web.fetch requires net, but the app declares net: false, then web.fetch is unavailable in that app. The framework refuses at locara verify:
✗ Tool "web.fetch" requires capability "net" but the app declares net: false.
Either: declare "net: { allowed_hosts: [...] }" in capabilities, or remove web.fetch.
This composition rule is the backbone of the trust model — adding a tool can never sneakily expand the app’s reach.
Custom tools (app-specific)
For tools not in the curated registry, apps can ship their own wasm tools:
my-app/
├── locara.json
└── tools/
└── my-custom-tool.wasm
"tooling": [
"ocr",
{ "name": "my-custom-tool", "path": "./tools/my-custom-tool.wasm", "signature": "...", "capabilities": [...] }
]
Custom tools:
- Must be signed by the publisher.
- Declare their own capability requirements.
- Run in the same Wasmtime sandbox as registry tools.
- Get reviewed alongside the app at submission time (capability requirements must be a subset of app’s).
See 10-tools.md for tool runtime details.
The “vibe” of declarations
The three layers, top to bottom:
Modalities + Tooling ← what developers write
↓ (expansion)
Capabilities ← what gets enforced
↓ (mapping)
macOS entitlements + Tauri permissions + Locara runtime checks
Developers write in the top layer 90% of the time. The middle layer is for power users / unusual cases. The bottom layer is for the runtime.
This means a typical locara.json for a voice-assistant app looks like:
{
"schema": "locara/v1",
"name": "voice-assistant",
"publisher": "kingtongchoo",
"version": "0.1.0",
"displayName": "Voice Assistant",
"description": "Fully-local voice assistant.",
"license": "Apache-2.0",
"icon": "./public/icon.png",
"screenshots": ["./screenshots/main.png"],
"category": "productivity",
"modalities": [
"voice-to-voice"
],
"tooling": [
"filesystem.search",
"code-exec.python"
],
"profiles": {
"mid": { "min_ram_gb": 16 },
"high": { "min_ram_gb": 32 }
},
"storage": {
"schema": "./db/schema.sql"
}
}
That’s the entire manifest. The framework expands it into ~30 lines of capabilities + model declarations under the hood.
AI authoring with modalities + tooling
For agent-friendly authoring (Cursor, Claude, etc.), modalities + tooling are the right surface to expose. An LLM scaffolding a Locara app can produce a small, valid manifest from a natural-language prompt:
“Build me a meeting recorder that transcribes and lets me search past meetings.”
→ modalities:
["speech-to-text"], tooling:["filesystem.search"], schema with ameetingstable.
The LLM doesn’t have to know the entire capability schema — it just picks from the named modality/tool catalog. Hallucination of “fake capabilities” is impossible because the catalog is closed.
This is why modalities + tooling exist as a layer above capabilities. Capabilities are precise and machine-enforced; modalities + tooling are simple and human/AI-authored.
Versioning
Modalities and tools are versioned independently of the app:
text-to-text@1is the v1 expansion. If we ever change whattext-to-textincludes, it becomestext-to-text@2.- Apps pin which version they expect:
"text-to-text@1"(default = latest at publish time, frozen in lockfile). - Adding a new modality is a SemVer minor bump of the spec. Removing or changing one is major (rare, requires migration).
Same for tools — ocr@1.2 is a specific tool version with documented behavior and capabilities.
Open questions
- (open) Should
text-to-imagebe in v1, or punt to v2? Diffusion models are heavy and the legal landscape (training data, NSFW) is messy. Leaning punt. - (open) Naming for
text-to-text-thinking— alternatives:reasoning,chain-of-thought. Bikeshed. - (open) Should “voice-to-voice” auto-include interruption / barge-in handling, or leave to apps?
- (open) Custom modalities: can apps declare their own modality types, or only use the curated set? Leaning curated-only for v1 (fewer surprises in review).
Cross-references
- The capabilities modalities/tools expand into: 03-capabilities.md
- SDK API surface for each modality: 05-sdk.md
- Tool runtime details (Wasmtime + WASI): 10-tools.md
- Models that back modalities: 09-models.md
- Why this layering matters for AI authoring: v0 + community design systems