Locara

Modalities & Models — Locara survey

A research note for designing Locara’s first-class modality catalogue. The canonical (terse, normative) list lives in spec/04-modalities.md. This document is the research backing: an exhaustive map of input/output transformations a foundation model can perform, the tooling each needs, and an honest ledger of what Locara has built.

The point of writing this down is that voice-to-voice was painful to wire because we discovered the surface area piece-by-piece (streaming audio I/O, frame protocol, jitter buffer, picker UI, capability declaration, IPC commands). Doing it this way for each new modality is wasteful. The plan: lay the whole map out once, identify the cross-cutting infrastructure, and ship modalities as tightly-scoped extensions on shared rails.

Snapshot taken: 2026-05-03. Foundation-model landscape moves in months — re-walk the catalog at every phase boundary.

Taxonomy source: this doc mirrors HuggingFace’s task taxonomy. Per-modality detail is split into individual files under ./modalities/. This file holds the overview, taxonomy table, cross-cutting infrastructure, cross-modality observations, and the BACKLOG punch-list.


How to use this document

  • Looking up a single modality → go to ./modalities/<name>.md directly. Each file has What it is / Models / Tooling / Locara today / Missing / See also in the same shape.
  • Framework planning → read the cross-cutting infrastructure section + the BACKLOG punch-list at the end of this file.
  • Research orientation → the taxonomy table below lists every modality with status; the cross-modality observations section names the patterns that recur.

Status legend (used everywhere):

SymbolMeaning
Built, tested, used by a reference app
🟡Crate exists, partial implementation, gaps documented
Skeleton crate / spec only
Not built
Out of scope for Locara v1 (recorded for completeness)

Modality taxonomy (HF-grouped)

Every modality lives at ./modalities/<id>.md. Click through for models, tooling, status, and gaps.

Multimodal (cross-modal foundation models)

IDOutStatus
text-to-texttext
text-to-text-thinkingtext + reasoning trace🟡
text-to-codecode🟡
audio-text-to-texttext
image-text-to-texttext🟡 (OCR ✅, VLM ❌)
image-text-to-imageimage
image-text-to-videovideo
document-question-answeringtext answer
video-text-to-texttext
visual-document-retrievalranking
any-to-any (omni)text + audio + image

Computer Vision

IDOutStatus
text-to-imageimage
image-to-image (no text)image
image-to-videovideo
text-to-videovideo
text-to-3d3D mesh
image-to-3d3D mesh
image-classificationlabel
zero-shot-image-classificationlabel from runtime list
object-detectionbboxes
image-segmentationmasks
depth-estimationdepth map
image-feature-extractionimage embedding
keypoint-detectionpose
video-classificationlabel
video-to-videovideo

Natural Language Processing

IDOutStatus
text-to-embeddingf32 vector
text-ranking (reranker)scores
translationtext🟡
summarizationtext🟡
classical-nlp-tasksvarious

Audio

IDOutStatus
speech-to-texttext + segments
text-to-speechaudio🟡
text-to-audio (SFX)audio
text-to-musicaudio
audio-to-audioaudio
audio-classificationlabel
voice-activity-detectionspeech spans
audio-to-embeddingf32 vector
voice-to-voiceaudio + text🟡

Tabular / Other

IDOutStatus
time-series-forecastingfuture values❌ candidate v2+
out-of-scopen/a⛔ tabular ML, RL, robotics, graph ML, fill-mask, unconditional image gen

Infrastructure pillars

A model is just weights — running it as a Locara modality requires six pillars of infrastructure around it. Every per-modality file in ./modalities/ breaks out its needs along these same six headings so it’s easy to spot which pillar a new modality reuses vs. introduces.

PillarWhat it coversExamples already in Locara
InferenceThe runtime that actually executes the modellocara-llama (autoregressive), locara-whisper (ASR specialist), locara-moshi (subprocess)
InputCapturing / loading what goes into the modellocara-microphone, locara-screencapture-audio, file picker via fs.pick
OutputRouting what comes out of the model to the userStreaming token Channel, voice playback queue (in-app), sqlite-vec for vectors
StoragePersisting weights, session state, and outputslocara-models::Cache (content-addressed weights), per-app SQLite, sqlite-vec
Interaction (IPC + SDK)The dotted command name + shape the WebView callsllm.chat_stream, transcribe.stream_*, voice.session_*
Capabilities (manifest)What the app must declare to be allowed to use the modalitymodels[], device.microphone, fs.user-folder, etc.

Per-modality files state explicitly which pillars are already covered (✅ with crate name) and which are gaps (❌ with what would need to be built). The cross-cutting table below is the roll-up showing which pillars are well-furnished vs. thin.


Cross-cutting infrastructure

Reading down the per-modality “Infrastructure required” sections, the recurring needs are:

CapabilityShared byLocara today
Content-addressed weight cache + resumable fetchevery modalitylocara-models::Cache
Inference backend trait (autoregressive, streaming)text-to-text, code, thinking, VLM, voicelocara-core::InferenceBackend
Encoder-only inference (BERT-style, embeddings, classifiers, rerankers)text-to-embedding, audio-to-embedding, image-feature-extraction, text-classification, NER, text-ranking, depth, classification🟡 (text only; audio, vision not wired)
Encoder-decoder inference (BART/MBART, translation, summarization specialists, Donut, Pix2Struct)translation, summarization, document-QA
Audio capture (mic) + system-audiospeech-to-text, voice-to-voice, VAD
Audio playback w/ jitter buffertext-to-speech, voice-to-voice, text-to-audio, text-to-music🟡 (in-app)
Image input pipeline (file → tensor)every CV modality
Video input pipelinevideo-to-text, multimodal-omni, video-classification, video-to-video
PDF / document rasterizerdocument-QA, visual-document-retrieval
Diffusion runtimetext-to-image, image-to-image, text-to-video, image-to-video, text-to-audio, text-to-music
Vector storetext-to-embedding, audio-to-embedding, image-feature-extraction, visual-document-retrievalsqlite-vec (text only — extension to multi-vector for ColPali still TBD)
Output file router (image / video / audio / 3D save)many
Long-running task IPC (progress, cancel, resume)text-to-image, text-to-video, text-to-3d, image-to-3d🟡 (channels exist; semantics ad-hoc)
Picker UI per modalityevery modality🟡 (voice picker exists; not a shared component)
Per-modality capability declarationevery modality🟡 (manifest grants models; modality expansion is partial)
Mask / box / depth / pose overlay UIobject-det, segmentation, depth, keypoints

Biggest missing rails (rank-ordered by how many modalities they unlock):

  1. Image input pipeline — 9 modalities depend on it, none built.
  2. Encoder-only inference for non-text — 8 modalities (audio embedding, image embedding, classifiers, rerankers, depth, classification).
  3. Diffusion runtime — 6 modalities (image, video, audio, music families).
  4. Encoder-decoder inference — 3 modalities (translation, document-QA, specialist summarization).
  5. PDF rasterizer — unlocks document-QA and ColPali, big for DocVault.

Cross-modality observations

A few things that surfaced from compiling the survey:

  1. The fastest route to filling out the catalogue is the omni model. Qwen3.5-Omni single-handedly covers text-to-text, text-to-text-thinking, image-text-to-text, video-text-to-text, audio-text-to-text, voice-to-voice, and any-to-any. If the subprocess-style backend pattern works for Moshi, the same pattern applies here. One crate, seven modalities.

  2. Apple-Silicon-bound users get short-changed if we lean on PyTorch-only paths. MLX-native or llama.cpp-via-Metal is dramatically faster. Lock the framework to that for v1; PyTorch via subprocess as the fallback for models without an MLX port yet.

  3. Cool-down semantics are missing for several modalities. device.speaker, fs.user-folder (writing generated images / videos), device.camera (for live VLM apps) all need the re-consent-on-update protection that the spec already requires for device.microphone. Pending.

  4. Streaming progress is hand-rolled per modality. Voice sessions, transcribe sessions, LLM streams, hypothetical diffusion progress — each has its own IPC shape. A shared progress.* channel pattern (start → progress events → final) would make adding new modalities cheaper.

  5. Picker UIs are duplicated. Voice has its own omni picker (apps/voice/src/App.tsx). Each future modality will want one. A shared <ModelPicker modality="text-to-image" /> in @locara/components would be the right factoring.

  6. HF lumps closely-related tasks under one name; Locara is simpler if we collapse the same way. image-to-text, image-text-to-text, and visual-question-answering are one modality from a tooling perspective — the model decides which sub-task it does well. Don’t proliferate manifest entries.

  7. Apple Vision is the cheapest first-implementation for many CV modalities — it covers OCR (already wired), object detection, classification, body/face landmarks at zero RAM cost. Worth prioritising the Swift-sidecar pattern (already used for OCR) before shipping Rust-side neural CV.


What this means for the BACKLOG

The existing BACKLOG entry “Modalities + capabilities + models catalogue” is correctly framed but underspecified. With this survey in hand, the concrete punch-list becomes:


References

(Selected. Per-modality references are inline in each modality file under ./modalities/.)