Modalities & Models — Locara survey
A research note for designing Locara’s first-class modality
catalogue. The canonical (terse, normative) list lives in
spec/04-modalities.md. This document
is the research backing: an exhaustive map of input/output
transformations a foundation model can perform, the tooling each
needs, and an honest ledger of what Locara has built.
The point of writing this down is that voice-to-voice was painful to wire because we discovered the surface area piece-by-piece (streaming audio I/O, frame protocol, jitter buffer, picker UI, capability declaration, IPC commands). Doing it this way for each new modality is wasteful. The plan: lay the whole map out once, identify the cross-cutting infrastructure, and ship modalities as tightly-scoped extensions on shared rails.
Snapshot taken: 2026-05-03. Foundation-model landscape moves in months — re-walk the catalog at every phase boundary.
Taxonomy source: this doc mirrors HuggingFace’s task taxonomy. Per-modality detail is split into individual files under
./modalities/. This file holds the overview, taxonomy table, cross-cutting infrastructure, cross-modality observations, and the BACKLOG punch-list.
How to use this document
- Looking up a single modality → go to
./modalities/<name>.mddirectly. Each file hasWhat it is/Models/Tooling/Locara today/Missing/See alsoin the same shape. - Framework planning → read the cross-cutting infrastructure section + the BACKLOG punch-list at the end of this file.
- Research orientation → the taxonomy table below lists every modality with status; the cross-modality observations section names the patterns that recur.
Status legend (used everywhere):
| Symbol | Meaning |
|---|---|
| ✅ | Built, tested, used by a reference app |
| 🟡 | Crate exists, partial implementation, gaps documented |
| ⏳ | Skeleton crate / spec only |
| ❌ | Not built |
| ⛔ | Out of scope for Locara v1 (recorded for completeness) |
Modality taxonomy (HF-grouped)
Every modality lives at ./modalities/<id>.md.
Click through for models, tooling, status, and gaps.
Multimodal (cross-modal foundation models)
| ID | Out | Status |
|---|---|---|
text-to-text | text | ✅ |
text-to-text-thinking | text + reasoning trace | 🟡 |
text-to-code | code | 🟡 |
audio-text-to-text | text | ❌ |
image-text-to-text | text | 🟡 (OCR ✅, VLM ❌) |
image-text-to-image | image | ❌ |
image-text-to-video | video | ❌ |
document-question-answering | text answer | ❌ |
video-text-to-text | text | ❌ |
visual-document-retrieval | ranking | ❌ |
any-to-any (omni) | text + audio + image | ⏳ |
Computer Vision
| ID | Out | Status |
|---|---|---|
text-to-image | image | ❌ |
image-to-image (no text) | image | ❌ |
image-to-video | video | ❌ |
text-to-video | video | ❌ |
text-to-3d | 3D mesh | ❌ |
image-to-3d | 3D mesh | ❌ |
image-classification | label | ❌ |
zero-shot-image-classification | label from runtime list | ❌ |
object-detection | bboxes | ❌ |
image-segmentation | masks | ❌ |
depth-estimation | depth map | ❌ |
image-feature-extraction | image embedding | ❌ |
keypoint-detection | pose | ❌ |
video-classification | label | ❌ |
video-to-video | video | ❌ |
Natural Language Processing
| ID | Out | Status |
|---|---|---|
text-to-embedding | f32 vector | ✅ |
text-ranking (reranker) | scores | ❌ |
translation | text | 🟡 |
summarization | text | 🟡 |
classical-nlp-tasks | various | ⛔ |
Audio
| ID | Out | Status |
|---|---|---|
speech-to-text | text + segments | ✅ |
text-to-speech | audio | 🟡 |
text-to-audio (SFX) | audio | ❌ |
text-to-music | audio | ❌ |
audio-to-audio | audio | ❌ |
audio-classification | label | ❌ |
voice-activity-detection | speech spans | ❌ |
audio-to-embedding | f32 vector | ❌ |
voice-to-voice | audio + text | 🟡 |
Tabular / Other
| ID | Out | Status |
|---|---|---|
time-series-forecasting | future values | ❌ candidate v2+ |
out-of-scope | n/a | ⛔ tabular ML, RL, robotics, graph ML, fill-mask, unconditional image gen |
Infrastructure pillars
A model is just weights — running it as a Locara modality
requires six pillars of infrastructure around it. Every
per-modality file in ./modalities/ breaks
out its needs along these same six headings so it’s easy to
spot which pillar a new modality reuses vs. introduces.
| Pillar | What it covers | Examples already in Locara |
|---|---|---|
| Inference | The runtime that actually executes the model | locara-llama (autoregressive), locara-whisper (ASR specialist), locara-moshi (subprocess) |
| Input | Capturing / loading what goes into the model | locara-microphone, locara-screencapture-audio, file picker via fs.pick |
| Output | Routing what comes out of the model to the user | Streaming token Channel, voice playback queue (in-app), sqlite-vec for vectors |
| Storage | Persisting weights, session state, and outputs | locara-models::Cache (content-addressed weights), per-app SQLite, sqlite-vec |
| Interaction (IPC + SDK) | The dotted command name + shape the WebView calls | llm.chat_stream, transcribe.stream_*, voice.session_* |
| Capabilities (manifest) | What the app must declare to be allowed to use the modality | models[], device.microphone, fs.user-folder, etc. |
Per-modality files state explicitly which pillars are already covered (✅ with crate name) and which are gaps (❌ with what would need to be built). The cross-cutting table below is the roll-up showing which pillars are well-furnished vs. thin.
Cross-cutting infrastructure
Reading down the per-modality “Infrastructure required” sections, the recurring needs are:
| Capability | Shared by | Locara today |
|---|---|---|
| Content-addressed weight cache + resumable fetch | every modality | ✅ locara-models::Cache |
| Inference backend trait (autoregressive, streaming) | text-to-text, code, thinking, VLM, voice | ✅ locara-core::InferenceBackend |
| Encoder-only inference (BERT-style, embeddings, classifiers, rerankers) | text-to-embedding, audio-to-embedding, image-feature-extraction, text-classification, NER, text-ranking, depth, classification | 🟡 (text only; audio, vision not wired) |
| Encoder-decoder inference (BART/MBART, translation, summarization specialists, Donut, Pix2Struct) | translation, summarization, document-QA | ❌ |
| Audio capture (mic) + system-audio | speech-to-text, voice-to-voice, VAD | ✅ |
| Audio playback w/ jitter buffer | text-to-speech, voice-to-voice, text-to-audio, text-to-music | 🟡 (in-app) |
| Image input pipeline (file → tensor) | every CV modality | ❌ |
| Video input pipeline | video-to-text, multimodal-omni, video-classification, video-to-video | ❌ |
| PDF / document rasterizer | document-QA, visual-document-retrieval | ❌ |
| Diffusion runtime | text-to-image, image-to-image, text-to-video, image-to-video, text-to-audio, text-to-music | ❌ |
| Vector store | text-to-embedding, audio-to-embedding, image-feature-extraction, visual-document-retrieval | ✅ sqlite-vec (text only — extension to multi-vector for ColPali still TBD) |
| Output file router (image / video / audio / 3D save) | many | ❌ |
| Long-running task IPC (progress, cancel, resume) | text-to-image, text-to-video, text-to-3d, image-to-3d | 🟡 (channels exist; semantics ad-hoc) |
| Picker UI per modality | every modality | 🟡 (voice picker exists; not a shared component) |
| Per-modality capability declaration | every modality | 🟡 (manifest grants models; modality expansion is partial) |
| Mask / box / depth / pose overlay UI | object-det, segmentation, depth, keypoints | ❌ |
Biggest missing rails (rank-ordered by how many modalities they unlock):
- Image input pipeline — 9 modalities depend on it, none built.
- Encoder-only inference for non-text — 8 modalities (audio embedding, image embedding, classifiers, rerankers, depth, classification).
- Diffusion runtime — 6 modalities (image, video, audio, music families).
- Encoder-decoder inference — 3 modalities (translation, document-QA, specialist summarization).
- PDF rasterizer — unlocks document-QA and ColPali, big for DocVault.
Cross-modality observations
A few things that surfaced from compiling the survey:
-
The fastest route to filling out the catalogue is the omni model. Qwen3.5-Omni single-handedly covers
text-to-text,text-to-text-thinking,image-text-to-text,video-text-to-text,audio-text-to-text,voice-to-voice, andany-to-any. If the subprocess-style backend pattern works for Moshi, the same pattern applies here. One crate, seven modalities. -
Apple-Silicon-bound users get short-changed if we lean on PyTorch-only paths. MLX-native or llama.cpp-via-Metal is dramatically faster. Lock the framework to that for v1; PyTorch via subprocess as the fallback for models without an MLX port yet.
-
Cool-down semantics are missing for several modalities.
device.speaker,fs.user-folder(writing generated images / videos),device.camera(for live VLM apps) all need the re-consent-on-update protection that the spec already requires fordevice.microphone. Pending. -
Streaming progress is hand-rolled per modality. Voice sessions, transcribe sessions, LLM streams, hypothetical diffusion progress — each has its own IPC shape. A shared
progress.*channel pattern (start → progress events → final) would make adding new modalities cheaper. -
Picker UIs are duplicated. Voice has its own omni picker (
apps/voice/src/App.tsx). Each future modality will want one. A shared<ModelPicker modality="text-to-image" />in@locara/componentswould be the right factoring. -
HF lumps closely-related tasks under one name; Locara is simpler if we collapse the same way.
image-to-text,image-text-to-text, andvisual-question-answeringare one modality from a tooling perspective — the model decides which sub-task it does well. Don’t proliferate manifest entries. -
Apple Vision is the cheapest first-implementation for many CV modalities — it covers OCR (already wired), object detection, classification, body/face landmarks at zero RAM cost. Worth prioritising the Swift-sidecar pattern (already used for OCR) before shipping Rust-side neural CV.
What this means for the BACKLOG
The existing BACKLOG entry “Modalities + capabilities + models catalogue” is correctly framed but underspecified. With this survey in hand, the concrete punch-list becomes:
-
Tier 1 — close gaps in modalities Locara already partially ships (low effort, high user-visible win):
- Apple SpeechAnalyzer integration for
speech-to-text - Stream-splitter for
text-to-text-thinking - VLM crate (
locara-vlm) forimage-text-to-text - Audio embedding crate for
audio-to-embedding - Spec entry for
text-to-code+ curated coder list - Helper persistence + AudioWorklet time-stretcher for
voice-to-voice(separate items) - Silero VAD for
voice-activity-detection(small model, large UX win for voice apps) - Reranker (BGE-Reranker-V2-M3) for
text-ranking— biggest single retrieval-quality multiplier
- Apple SpeechAnalyzer integration for
-
Tier 2 — new modalities, ride on existing rails:
- Qwen3.5-Omni via subprocess for
any-to-any(mirrors Moshi shape; highest-leverage — covers 7 modalities at once) - SAM 2 for
image-segmentation— real-time, Apache-2.0, unlocks background-removal-class apps - Depth-Anything-V2 for
depth-estimation— tiny model, broad creative use cases - DINOv2 + SigLIP wired for
image-feature-extraction/zero-shot-image-classification— visual search apps
- Qwen3.5-Omni via subprocess for
-
Tier 3 — new infrastructure required:
- Image input pipeline in the SDK — biggest single rail enabling 9+ modalities
- Encoder-only inference for non-text models (audio embed, vision embed, classifiers)
- Diffusion runtime crate (
locara-diffusion) — unlockstext-to-image,image-text-to-image,text-to-audio,text-to-music, eventuallytext-to-videoandimage-to-video - PDF rasterizer + DocVQA model — unlocks DocVault as a real product
- ColPali integration for
visual-document-retrieval— follows directly from PDF rasterizer + multi-vector storage
-
Tier 4 — defer, hardware-bound or niche:
text-to-video/image-to-video— VRAM costs don’t fit Locara’s 16 GB-laptop promise yettext-to-3d/image-to-3d— niche; depends on a 3D viewer component existing firsttime-series-forecasting— interesting but no reference app motivates v1- Classical NLP tasks
(
classical-nlp-tasks) — chat LLM covers; specialize only if a reference app needs the speed/determinism
-
Out of scope (explicitly) — see
out-of-scope.md:- Tabular classification/regression
- Reinforcement learning, robotics, graph ML
- Unconditional image generation
- Fill-mask as an app-facing modality
References
(Selected. Per-modality references are inline in each modality
file under ./modalities/.)
- HF taxonomy: HuggingFace Tasks
- General catalogues: BentoML: open-source LLMs, Will It Run AI: 2026 reasoning VRAM guide
- Speech: DigitalOcean: best TTS models 2026, BentoML: TTS
- Vision-Language: BentoML: VLMs, Labellerr: open-source VLMs 2026
- Image: BentoML: image generation, SiliconFlow: lightweight image-gen 2026
- Video: Hyperstack: video gen 2026, Spheron: image-to-video on GPU cloud 2026
- Audio / music: SiliconFlow: music generation, Tutorialsdojo: audio AI 2026
- Embeddings: BentoML: embeddings, Tiger Data: best for RAG
- Reasoning: Clarifai: top reasoning 2026, TokenMix: QwQ-32B vs R1
- Coding: MindStudio: agentic coding LLMs
- Document retrieval: ColPali paper (arXiv)
- Reranking: ZeroEntropy: open-source rerank alternatives
- VAD: Silero VAD GitHub, Picovoice: VAD comparison
- Translation: SiliconFlow: translation 2026, Meta Omnilingual MT
- Segmentation: SAM 2 GitHub
- Depth: Depth-Anything-V2 GitHub
- Image features: DINOv2 GitHub
- Time series: Chronos GitHub
- Omni: Qwen3-Omni release
- Existing Locara work:
spec/04-modalities.md,notes/voice-to-voice-slms.md,notes/whisper-and-stt-landscape.md,notes/mlx.md