`image-text-to-text`

HF group: Multimodal · Status: 🟡 partial (OCR ✅, VLM ❌)

HF aliases: image-to-text, visual-question-answering. HuggingFace lists these separately because of historical model specialization (OCR-only vs caption-only vs VQA), but from a tooling perspective they’re one modality — the model picks decide which sub-task is best at any given moment.

What it is

Image + (optional) text prompt → text. Covers:

Captioning — “describe this”.
VQA — “how many cats are in this picture?”.
OCR — “transcribe verbatim”.
Document parsing — “extract the table” (overlaps with document-question-answering).

Open-weight models

Model	Params	Released	License	Quality	Notes
MiniCPM-V 3 B	3 B	2024	Apache-2.0	Lightweight VLM	64-token visual rep — great for laptops.
Qwen2.5-VL-7B	7 B	2025	Apache-2.0	Strong	Video input also.
Qwen2.5-VL-72B	72 B	2025	Apache-2.0	Top open VLM	Heavy, M-series Pro/Max only.
Llama-3.2-Vision-11B	11 B	2024	Llama community	Strong OCR + DocVQA	128k context.
Gemma-3-Vision (variants)	4-26 B	2025	Gemma	Strong	Multilingual.
Florence-2-large	770 M	2024	MIT	Specialist (OCR, detection, caption)	Very fast for narrow tasks.
Apple Vision (`VNRecognizeTextRequest`)	n/a	macOS	Apple	Solid OCR	What `locara-vision-ocr` wraps today.

Infrastructure required

Inference

✅ OCR specialist: locara-vision-ocr (macOS Vision native — zero RAM, no model download).
❌ General VLM crate (locara-vlm) for caption / VQA / document parsing — would back to llama.cpp’s mtmd path on Apple Silicon.

Input

❌ Image input pipeline in the SDK (file → bytes → decoded tensor). Biggest single missing rail — 9 modalities depend on it.
File picker via fs.user-selected.
Optional text prompt.

Output

✅ For OCR: structured text + bounding boxes via ocr.from_bytes.
❌ For VLM: streaming token Channel (same shape as text-to-text).

Storage

✅ Weights via locara-models::Cache.
App-side: extracted OCR results / captions persisted via locara-storage.

Interaction (IPC + SDK)

✅ IPC: ocr.from_bytes (built on locara-vision-ocr).
❌ IPC stub: vlm.describe, vlm.ask reserved in spec, not implemented.
Picker UI for “OCR vs caption vs Q&A” intent — not yet a shared component.

Capabilities (manifest)

✅ capabilities.fs.user-selected: true for file-input apps.
❌ capabilities.device.camera for live VLM apps (cool-down semantics not enforced).
capabilities.models[] for the VLM model.

Gaps

General VLM crate (locara-vlm) probably best backed by llama.cpp’s mtmd path for Apple Silicon.
vlm.describe IPC stub exists in spec, not implemented.
Cross-platform OCR fallback (current crate is macOS-only).
Image input pipeline in the SDK.

image-text-to-text