image-text-to-text
HF group: Multimodal · Status: 🟡 partial (OCR ✅, VLM ❌)
HF aliases: image-to-text, visual-question-answering.
HuggingFace lists these separately because of historical model
specialization (OCR-only vs caption-only vs VQA), but from a
tooling perspective they’re one modality — the model picks
decide which sub-task is best at any given moment.
What it is
Image + (optional) text prompt → text. Covers:
- Captioning — “describe this”.
- VQA — “how many cats are in this picture?”.
- OCR — “transcribe verbatim”.
- Document parsing — “extract the table” (overlaps with
document-question-answering).
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| MiniCPM-V 3 B | 3 B | 2024 | Apache-2.0 | Lightweight VLM | 64-token visual rep — great for laptops. |
| Qwen2.5-VL-7B | 7 B | 2025 | Apache-2.0 | Strong | Video input also. |
| Qwen2.5-VL-72B | 72 B | 2025 | Apache-2.0 | Top open VLM | Heavy, M-series Pro/Max only. |
| Llama-3.2-Vision-11B | 11 B | 2024 | Llama community | Strong OCR + DocVQA | 128k context. |
| Gemma-3-Vision (variants) | 4-26 B | 2025 | Gemma | Strong | Multilingual. |
| Florence-2-large | 770 M | 2024 | MIT | Specialist (OCR, detection, caption) | Very fast for narrow tasks. |
Apple Vision (VNRecognizeTextRequest) | n/a | macOS | Apple | Solid OCR | What locara-vision-ocr wraps today. |
Infrastructure required
Inference
- ✅ OCR specialist:
locara-vision-ocr(macOS Vision native — zero RAM, no model download). - ❌ General VLM crate (
locara-vlm) for caption / VQA / document parsing — would back to llama.cpp’smtmdpath on Apple Silicon.
Input
- ❌ Image input pipeline in the SDK (file → bytes → decoded tensor). Biggest single missing rail — 9 modalities depend on it.
- File picker via
fs.user-selected. - Optional text prompt.
Output
- ✅ For OCR: structured text + bounding boxes via
ocr.from_bytes. - ❌ For VLM: streaming token Channel (same shape as
text-to-text).
Storage
- ✅ Weights via
locara-models::Cache. - App-side: extracted OCR results / captions persisted via
locara-storage.
Interaction (IPC + SDK)
- ✅ IPC:
ocr.from_bytes(built onlocara-vision-ocr). - ❌ IPC stub:
vlm.describe,vlm.askreserved in spec, not implemented. - Picker UI for “OCR vs caption vs Q&A” intent — not yet a shared component.
Capabilities (manifest)
- ✅
capabilities.fs.user-selected: truefor file-input apps. - ❌
capabilities.device.camerafor live VLM apps (cool-down semantics not enforced). capabilities.models[]for the VLM model.
Gaps
- General VLM crate (
locara-vlm) probably best backed by llama.cpp’smtmdpath for Apple Silicon. vlm.describeIPC stub exists in spec, not implemented.- Cross-platform OCR fallback (current crate is macOS-only).
- Image input pipeline in the SDK.
See also
document-question-answering— when layout mattersvisual-document-retrieval— VLM-based document searchvideo-text-to-text— same models, video inputany-to-any- Crates:
locara-vision-ocr - Index:
../modalities-and-models-survey.md