04 — Modalities & Tooling

The two first-class declarations in a Locara app: what input/output transformations does it perform (modalities), and what tools can the LLM or app code call (tooling).

These are the two lenses developers think in when designing an app. Underneath, both expand into the lower-level capabilities, models, and SDK calls. But the developer-facing surface stays at this higher level.

Why two declarations, not one big capability bag

A flat capability list is fine for the runtime to enforce, but it’s terrible for the developer to author or for an AI agent to fill in. “I want to build a voice assistant” should not require knowing:

device.microphone: true
A whisper-class model in models[]
An LLM model in models[]
(Maybe) a TTS model
Audio recording capability
A way to play audio output

Instead, declare:

"modalities": ["voice-to-voice"]

…and the framework expands all of it. The developer can override specifics, but the default expansion produces a working app.

Same for tooling. “I want my chat app to do OCR on uploaded PDFs” should be:

"tooling": ["ocr"]

…not a hand-wired list of OCR model + filesystem capabilities + SDK access.

Modalities

A modality is an input-shape → output-shape transformation. Locara v1 defines a closed taxonomy:

Modality	Inputs	Outputs	Implies
`text-to-text`	text	text (streaming)	LLM model + `llm.chat` SDK
`text-to-text-thinking`	text	text + reasoning trace	LLM model with reasoning support + `llm.chat({ thinking: true })`
`speech-to-text`	audio file or live mic	text + segments	STT model + `transcribe.*` SDK + (live) `device.microphone`
`text-to-speech`	text	audio	TTS model + `tts.*` SDK + `audio.play`
`voice-to-voice`	live mic	audio	STT + LLM + TTS chain + `device.microphone` + `audio.play`
`image-to-text`	image	text description / OCR / structured	VLM or OCR model + `vlm.` or `ocr.` SDK + `fs.user-selected`
`text-to-image`	text	image	Image-gen model + `image.generate` SDK
`text-to-embedding`	text	float vector	Embedding model + `embed.*` SDK
`audio-to-embedding`	audio	float vector	Audio embedding model + `embed.audio` SDK

Each modality is a named bundle with a stable expansion. Adding a new modality to the framework = a one-time decision; apps that adopt it inherit the expansion.

Declaring a modality (simple form)

"modalities": ["speech-to-text", "text-to-text"]

Picks Locara-default models and capabilities. Works for ~80% of apps.

Declaring a modality (overridden form)

"modalities": [
  { 
    "type": "speech-to-text",
    "model": "whisper-large-v3-q4@sha256:abc...",
    "live": true
  },
  {
    "type": "text-to-text",
    "model": "qwen2.5-3b-instruct-q4@sha256:def..."
  }
]

Override picks specific models, enables/disables live mode, etc.

Modality expansion

When locara dev or locara build reads the manifest, modalities expand into capabilities/models/SDK:

"modalities": ["speech-to-text"]
  ↓
"capabilities": {
  "device.microphone": true,
  "fs.user-selected": "read",   // for file-based transcription
  "models": ["whisper-large-v3-q4@sha256:..."]
},
"sdk_modules": ["transcribe", "audio"]

The expansion is deterministic and inspectable — locara verify --explain shows the full expansion, so developers can see exactly what was added.

If a developer wants more granularity than a modality offers (e.g., they want STT but specifically NOT live mode), they use the overridden form or fall back to declaring capabilities directly.

Per-profile modality overrides

Modalities can vary by device profile (see 02-manifest.md profiles):

"modalities": [
  { "type": "text-to-text", "model": "qwen2.5-3b-q4" }
],
"profiles": {
  "low":  { "modalities_override": [{ "type": "text-to-text", "model": "qwen-1.5b-q4" }] },
  "high": { "modalities_override": [{ "type": "text-to-text", "model": "qwen-7b-q4" }] }
}

The runtime picks the right model size for the user’s hardware automatically. Developer thinks in modalities; users get the right model.

Tooling

A tool is something an LLM or the app’s code can call to act on the world. Locara treats tooling as first-class — there’s a curated registry of built-in tools, plus a path for app-specific custom tools.

Built-in tools (v1 catalog)

These are signed, audited, sandboxed wasm modules shipped as part of the Locara registry. Apps opt in by declaration:

Tool	Does	Capabilities required
`ocr`	Extract text + structure from image/PDF	model: GLM-OCR or RapidOCR
`filesystem.search`	Search user-selected directory by name/content	`fs.user-selected`
`filesystem.read`	Read user-selected file as text/bytes	`fs.user-selected`
`filesystem.write`	Write to user-selected location	`fs.user-selected: "read-write"`
`code-exec.python`	Execute Python in a wasm sandbox (Pyodide-style)	none (sandbox-isolated)
`code-exec.js`	Execute JS in a wasm sandbox	none
`bash.read-only`	Run safe read-only shell commands (ls, grep, find)	scoped fs
`image.resize`	Resize/crop images	none
`image.format-convert`	Convert image formats	none
`pdf.split`	Split PDFs by page	none
`pdf.extract-text`	Extract text from PDF	none
`audio.transcribe`	Transcribe audio (uses STT modality if declared)	inherits
`web.fetch`	Fetch a URL	`net: { allowed_hosts: [...] }` only
`text.summarize`	Summarize via LLM	inherits LLM modality
`text.translate`	Translate via LLM	inherits LLM modality

This catalog grows over time. Adding a new tool = a one-time PR with the wasm artifact, signed by Locara’s key, with documented capability requirements.

Declaring tooling

"tooling": ["ocr", "filesystem.search", "code-exec.python"]

Each entry must be in the curated registry. Apps opt into specific tools.

The runtime exposes them via the SDK:

import { tools } from '@locara/sdk'

// Direct invocation
const result = await tools.ocr({ source: pdfBlob })

// Or pass to an LLM as callable tools
const response = await llm.chat({
  model: 'qwen2.5-3b',
  messages: [...],
  tools: ['ocr', 'filesystem.search']  // LLM can choose to call
})

Tooling expansion

Just like modalities, tools expand into underlying capabilities + SDK:

"tooling": ["filesystem.search", "ocr"]
  ↓
"capabilities": {
  "fs.user-selected": "read",
  "tools": [
    "wasm.locara.filesystem-search@1.0",
    "wasm.locara.ocr@1.2"
  ],
  "models": ["glm-ocr-1.5-q8@sha256:..."]   // OCR's model dependency added automatically
},
"sdk_modules": ["tools", "fs"]

Capability inheritance: tools cannot exceed app

A tool can never grant capabilities the app didn’t declare. If web.fetch requires net, but the app declares net: false, then web.fetch is unavailable in that app. The framework refuses at locara verify:

✗ Tool "web.fetch" requires capability "net" but the app declares net: false.
   Either: declare "net: { allowed_hosts: [...] }" in capabilities, or remove web.fetch.

This composition rule is the backbone of the trust model — adding a tool can never sneakily expand the app’s reach.

Custom tools (app-specific)

For tools not in the curated registry, apps can ship their own wasm tools:

my-app/
├── locara.json
└── tools/
    └── my-custom-tool.wasm

"tooling": [
  "ocr",
  { "name": "my-custom-tool", "path": "./tools/my-custom-tool.wasm", "signature": "...", "capabilities": [...] }
]

Custom tools:

Must be signed by the publisher.
Declare their own capability requirements.
Run in the same Wasmtime sandbox as registry tools.
Get reviewed alongside the app at submission time (capability requirements must be a subset of app’s).

See 10-tools.md for tool runtime details.

The “vibe” of declarations

The three layers, top to bottom:

Modalities + Tooling          ← what developers write
       ↓ (expansion)
Capabilities                  ← what gets enforced
       ↓ (mapping)
macOS entitlements + Tauri permissions + Locara runtime checks

Developers write in the top layer 90% of the time. The middle layer is for power users / unusual cases. The bottom layer is for the runtime.

This means a typical locara.json for a voice-assistant app looks like:

{
  "schema": "locara/v1",
  "name": "voice-assistant",
  "publisher": "kingtongchoo",
  "version": "0.1.0",

  "displayName": "Voice Assistant",
  "description": "Fully-local voice assistant.",
  "license": "Apache-2.0",
  "icon": "./public/icon.png",
  "screenshots": ["./screenshots/main.png"],
  "category": "productivity",

  "modalities": [
    "voice-to-voice"
  ],
  "tooling": [
    "filesystem.search",
    "code-exec.python"
  ],

  "profiles": {
    "mid":  { "min_ram_gb": 16 },
    "high": { "min_ram_gb": 32 }
  },

  "storage": {
    "schema": "./db/schema.sql"
  }
}

That’s the entire manifest. The framework expands it into ~30 lines of capabilities + model declarations under the hood.

AI authoring with modalities + tooling

For agent-friendly authoring (Cursor, Claude, etc.), modalities + tooling are the right surface to expose. An LLM scaffolding a Locara app can produce a small, valid manifest from a natural-language prompt:

“Build me a meeting recorder that transcribes and lets me search past meetings.”

→ modalities: ["speech-to-text"], tooling: ["filesystem.search"], schema with a meetings table.

The LLM doesn’t have to know the entire capability schema — it just picks from the named modality/tool catalog. Hallucination of “fake capabilities” is impossible because the catalog is closed.

This is why modalities + tooling exist as a layer above capabilities. Capabilities are precise and machine-enforced; modalities + tooling are simple and human/AI-authored.

Versioning

Modalities and tools are versioned independently of the app:

text-to-text@1 is the v1 expansion. If we ever change what text-to-text includes, it becomes text-to-text@2.
Apps pin which version they expect: "text-to-text@1" (default = latest at publish time, frozen in lockfile).
Adding a new modality is a SemVer minor bump of the spec. Removing or changing one is major (rare, requires migration).

Same for tools — ocr@1.2 is a specific tool version with documented behavior and capabilities.

Open questions

(open) Should text-to-image be in v1, or punt to v2? Diffusion models are heavy and the legal landscape (training data, NSFW) is messy. Leaning punt.
(open) Naming for text-to-text-thinking — alternatives: reasoning, chain-of-thought. Bikeshed.
(open) Should “voice-to-voice” auto-include interruption / barge-in handling, or leave to apps?
(open) Custom modalities: can apps declare their own modality types, or only use the curated set? Leaning curated-only for v1 (fewer surprises in review).

Cross-references

The capabilities modalities/tools expand into: 03-capabilities.md
SDK API surface for each modality: 05-sdk.md
Tool runtime details (Wasmtime + WASI): 10-tools.md
Models that back modalities: 09-models.md
Why this layering matters for AI authoring: v0 + community design systems