Modalities & Models — Locara survey

A research note for designing Locara’s first-class modality catalogue. The canonical (terse, normative) list lives in spec/04-modalities.md. This document is the research backing: an exhaustive map of input/output transformations a foundation model can perform, the tooling each needs, and an honest ledger of what Locara has built.

The point of writing this down is that voice-to-voice was painful to wire because we discovered the surface area piece-by-piece (streaming audio I/O, frame protocol, jitter buffer, picker UI, capability declaration, IPC commands). Doing it this way for each new modality is wasteful. The plan: lay the whole map out once, identify the cross-cutting infrastructure, and ship modalities as tightly-scoped extensions on shared rails.

Snapshot taken: 2026-05-03. Foundation-model landscape moves in months — re-walk the catalog at every phase boundary.

Taxonomy source: this doc mirrors HuggingFace’s task taxonomy. Per-modality detail is split into individual files under ./modalities/. This file holds the overview, taxonomy table, cross-cutting infrastructure, cross-modality observations, and the BACKLOG punch-list.

How to use this document

Looking up a single modality → go to ./modalities/<name>.md directly. Each file has What it is / Models / Tooling / Locara today / Missing / See also in the same shape.
Framework planning → read the cross-cutting infrastructure section + the BACKLOG punch-list at the end of this file.
Research orientation → the taxonomy table below lists every modality with status; the cross-modality observations section names the patterns that recur.

Status legend (used everywhere):

Symbol	Meaning
✅	Built, tested, used by a reference app
🟡	Crate exists, partial implementation, gaps documented
⏳	Skeleton crate / spec only
❌	Not built
⛔	Out of scope for Locara v1 (recorded for completeness)

Modality taxonomy (HF-grouped)

Every modality lives at ./modalities/<id>.md. Click through for models, tooling, status, and gaps.

ID	Out	Status
`text-to-text`	text	✅
`text-to-text-thinking`	text + reasoning trace	🟡
`text-to-code`	code	🟡
`audio-text-to-text`	text	❌
`image-text-to-text`	text	🟡 (OCR ✅, VLM ❌)
`image-text-to-image`	image	❌
`image-text-to-video`	video	❌
`document-question-answering`	text answer	❌
`video-text-to-text`	text	❌
`visual-document-retrieval`	ranking	❌
`any-to-any` (omni)	text + audio + image	⏳

Computer Vision

ID	Out	Status
`text-to-image`	image	❌
`image-to-image` (no text)	image	❌
`image-to-video`	video	❌
`text-to-video`	video	❌
`text-to-3d`	3D mesh	❌
`image-to-3d`	3D mesh	❌
`image-classification`	label	❌
`zero-shot-image-classification`	label from runtime list	❌
`object-detection`	bboxes	❌
`image-segmentation`	masks	❌
`depth-estimation`	depth map	❌
`image-feature-extraction`	image embedding	❌
`keypoint-detection`	pose	❌
`video-classification`	label	❌
`video-to-video`	video	❌

Natural Language Processing

ID	Out	Status
`text-to-embedding`	f32 vector	✅
`text-ranking` (reranker)	scores	❌
`translation`	text	🟡
`summarization`	text	🟡
`classical-nlp-tasks`	various	⛔

Audio

ID	Out	Status
`speech-to-text`	text + segments	✅
`text-to-speech`	audio	🟡
`text-to-audio` (SFX)	audio	❌
`text-to-music`	audio	❌
`audio-to-audio`	audio	❌
`audio-classification`	label	❌
`voice-activity-detection`	speech spans	❌
`audio-to-embedding`	f32 vector	❌
`voice-to-voice`	audio + text	🟡

Tabular / Other

ID	Out	Status
`time-series-forecasting`	future values	❌ candidate v2+
`out-of-scope`	n/a	⛔ tabular ML, RL, robotics, graph ML, fill-mask, unconditional image gen

Infrastructure pillars

A model is just weights — running it as a Locara modality requires six pillars of infrastructure around it. Every per-modality file in ./modalities/ breaks out its needs along these same six headings so it’s easy to spot which pillar a new modality reuses vs. introduces.

Pillar	What it covers	Examples already in Locara
Inference	The runtime that actually executes the model	`locara-llama` (autoregressive), `locara-whisper` (ASR specialist), `locara-moshi` (subprocess)
Input	Capturing / loading what goes into the model	`locara-microphone`, `locara-screencapture-audio`, file picker via `fs.pick`
Output	Routing what comes out of the model to the user	Streaming token Channel, voice playback queue (in-app), `sqlite-vec` for vectors
Storage	Persisting weights, session state, and outputs	`locara-models::Cache` (content-addressed weights), per-app SQLite, `sqlite-vec`
Interaction (IPC + SDK)	The dotted command name + shape the WebView calls	`llm.chat_stream`, `transcribe.stream_`, `voice.session_`
Capabilities (manifest)	What the app must declare to be allowed to use the modality	`models[]`, `device.microphone`, `fs.user-folder`, etc.

Per-modality files state explicitly which pillars are already covered (✅ with crate name) and which are gaps (❌ with what would need to be built). The cross-cutting table below is the roll-up showing which pillars are well-furnished vs. thin.

Cross-cutting infrastructure

Reading down the per-modality “Infrastructure required” sections, the recurring needs are:

Capability	Shared by	Locara today
Content-addressed weight cache + resumable fetch	every modality	✅ `locara-models::Cache`
Inference backend trait (autoregressive, streaming)	text-to-text, code, thinking, VLM, voice	✅ `locara-core::InferenceBackend`
Encoder-only inference (BERT-style, embeddings, classifiers, rerankers)	text-to-embedding, audio-to-embedding, image-feature-extraction, text-classification, NER, text-ranking, depth, classification	🟡 (text only; audio, vision not wired)
Encoder-decoder inference (BART/MBART, translation, summarization specialists, Donut, Pix2Struct)	translation, summarization, document-QA	❌
Audio capture (mic) + system-audio	speech-to-text, voice-to-voice, VAD	✅
Audio playback w/ jitter buffer	text-to-speech, voice-to-voice, text-to-audio, text-to-music	🟡 (in-app)
Image input pipeline (file → tensor)	every CV modality	❌
Video input pipeline	video-to-text, multimodal-omni, video-classification, video-to-video	❌
PDF / document rasterizer	document-QA, visual-document-retrieval	❌
Diffusion runtime	text-to-image, image-to-image, text-to-video, image-to-video, text-to-audio, text-to-music	❌
Vector store	text-to-embedding, audio-to-embedding, image-feature-extraction, visual-document-retrieval	✅ `sqlite-vec` (text only — extension to multi-vector for ColPali still TBD)
Output file router (image / video / audio / 3D save)	many	❌
Long-running task IPC (progress, cancel, resume)	text-to-image, text-to-video, text-to-3d, image-to-3d	🟡 (channels exist; semantics ad-hoc)
Picker UI per modality	every modality	🟡 (voice picker exists; not a shared component)
Per-modality capability declaration	every modality	🟡 (manifest grants models; modality expansion is partial)
Mask / box / depth / pose overlay UI	object-det, segmentation, depth, keypoints	❌

Biggest missing rails (rank-ordered by how many modalities they unlock):

Image input pipeline — 9 modalities depend on it, none built.
Encoder-only inference for non-text — 8 modalities (audio embedding, image embedding, classifiers, rerankers, depth, classification).
Diffusion runtime — 6 modalities (image, video, audio, music families).
Encoder-decoder inference — 3 modalities (translation, document-QA, specialist summarization).
PDF rasterizer — unlocks document-QA and ColPali, big for DocVault.

Cross-modality observations

A few things that surfaced from compiling the survey:

The fastest route to filling out the catalogue is the omni model. Qwen3.5-Omni single-handedly covers text-to-text, text-to-text-thinking, image-text-to-text, video-text-to-text, audio-text-to-text, voice-to-voice, and any-to-any. If the subprocess-style backend pattern works for Moshi, the same pattern applies here. One crate, seven modalities.
Apple-Silicon-bound users get short-changed if we lean on PyTorch-only paths. MLX-native or llama.cpp-via-Metal is dramatically faster. Lock the framework to that for v1; PyTorch via subprocess as the fallback for models without an MLX port yet.
Cool-down semantics are missing for several modalities. device.speaker, fs.user-folder (writing generated images / videos), device.camera (for live VLM apps) all need the re-consent-on-update protection that the spec already requires for device.microphone. Pending.
Streaming progress is hand-rolled per modality. Voice sessions, transcribe sessions, LLM streams, hypothetical diffusion progress — each has its own IPC shape. A shared progress.* channel pattern (start → progress events → final) would make adding new modalities cheaper.
Picker UIs are duplicated. Voice has its own omni picker (apps/voice/src/App.tsx). Each future modality will want one. A shared <ModelPicker modality="text-to-image" /> in @locara/components would be the right factoring.
HF lumps closely-related tasks under one name; Locara is simpler if we collapse the same way. image-to-text, image-text-to-text, and visual-question-answering are one modality from a tooling perspective — the model decides which sub-task it does well. Don’t proliferate manifest entries.
Apple Vision is the cheapest first-implementation for many CV modalities — it covers OCR (already wired), object detection, classification, body/face landmarks at zero RAM cost. Worth prioritising the Swift-sidecar pattern (already used for OCR) before shipping Rust-side neural CV.

What this means for the BACKLOG

The existing BACKLOG entry “Modalities + capabilities + models catalogue” is correctly framed but underspecified. With this survey in hand, the concrete punch-list becomes:

Tier 1 — close gaps in modalities Locara already partially ships (low effort, high user-visible win):
- Apple SpeechAnalyzer integration for speech-to-text
- Stream-splitter for text-to-text-thinking
- VLM crate (locara-vlm) for image-text-to-text
- Audio embedding crate for audio-to-embedding
- Spec entry for text-to-code + curated coder list
- Helper persistence + AudioWorklet time-stretcher for voice-to-voice (separate items)
- Silero VAD for voice-activity-detection (small model, large UX win for voice apps)
- Reranker (BGE-Reranker-V2-M3) for text-ranking — biggest single retrieval-quality multiplier
Tier 2 — new modalities, ride on existing rails:
- Qwen3.5-Omni via subprocess for any-to-any (mirrors Moshi shape; highest-leverage — covers 7 modalities at once)
- SAM 2 for image-segmentation — real-time, Apache-2.0, unlocks background-removal-class apps
- Depth-Anything-V2 for depth-estimation — tiny model, broad creative use cases
- DINOv2 + SigLIP wired for image-feature-extraction / zero-shot-image-classification — visual search apps
Tier 3 — new infrastructure required:
- Image input pipeline in the SDK — biggest single rail enabling 9+ modalities
- Encoder-only inference for non-text models (audio embed, vision embed, classifiers)
- Diffusion runtime crate (locara-diffusion) — unlocks text-to-image, image-text-to-image, text-to-audio, text-to-music, eventually text-to-video and image-to-video
- PDF rasterizer + DocVQA model — unlocks DocVault as a real product
- ColPali integration for visual-document-retrieval — follows directly from PDF rasterizer + multi-vector storage
Tier 4 — defer, hardware-bound or niche:
- text-to-video / image-to-video — VRAM costs don’t fit Locara’s 16 GB-laptop promise yet
- text-to-3d / image-to-3d — niche; depends on a 3D viewer component existing first
- time-series-forecasting — interesting but no reference app motivates v1
- Classical NLP tasks (classical-nlp-tasks) — chat LLM covers; specialize only if a reference app needs the speed/determinism
Out of scope (explicitly) — see out-of-scope.md:
- Tabular classification/regression
- Reinforcement learning, robotics, graph ML
- Unconditional image generation
- Fill-mask as an app-facing modality

References

(Selected. Per-modality references are inline in each modality file under ./modalities/.)

HF taxonomy: HuggingFace Tasks
General catalogues: BentoML: open-source LLMs, Will It Run AI: 2026 reasoning VRAM guide
Speech: DigitalOcean: best TTS models 2026, BentoML: TTS
Vision-Language: BentoML: VLMs, Labellerr: open-source VLMs 2026
Image: BentoML: image generation, SiliconFlow: lightweight image-gen 2026
Video: Hyperstack: video gen 2026, Spheron: image-to-video on GPU cloud 2026
Audio / music: SiliconFlow: music generation, Tutorialsdojo: audio AI 2026
Embeddings: BentoML: embeddings, Tiger Data: best for RAG
Reasoning: Clarifai: top reasoning 2026, TokenMix: QwQ-32B vs R1
Coding: MindStudio: agentic coding LLMs
Document retrieval: ColPali paper (arXiv)
Reranking: ZeroEntropy: open-source rerank alternatives
VAD: Silero VAD GitHub, Picovoice: VAD comparison
Translation: SiliconFlow: translation 2026, Meta Omnilingual MT
Segmentation: SAM 2 GitHub
Depth: Depth-Anything-V2 GitHub
Image features: DINOv2 GitHub
Time series: Chronos GitHub
Omni: Qwen3-Omni release
Existing Locara work: spec/04-modalities.md, notes/voice-to-voice-slms.md, notes/whisper-and-stt-landscape.md, notes/mlx.md