Locara

image-text-to-text

HF group: Multimodal · Status: 🟡 partial (OCR ✅, VLM ❌)

HF aliases: image-to-text, visual-question-answering. HuggingFace lists these separately because of historical model specialization (OCR-only vs caption-only vs VQA), but from a tooling perspective they’re one modality — the model picks decide which sub-task is best at any given moment.

What it is

Image + (optional) text prompt → text. Covers:

  • Captioning — “describe this”.
  • VQA — “how many cats are in this picture?”.
  • OCR — “transcribe verbatim”.
  • Document parsing — “extract the table” (overlaps with document-question-answering).

Open-weight models

ModelParamsReleasedLicenseQualityNotes
MiniCPM-V 3 B3 B2024Apache-2.0Lightweight VLM64-token visual rep — great for laptops.
Qwen2.5-VL-7B7 B2025Apache-2.0StrongVideo input also.
Qwen2.5-VL-72B72 B2025Apache-2.0Top open VLMHeavy, M-series Pro/Max only.
Llama-3.2-Vision-11B11 B2024Llama communityStrong OCR + DocVQA128k context.
Gemma-3-Vision (variants)4-26 B2025GemmaStrongMultilingual.
Florence-2-large770 M2024MITSpecialist (OCR, detection, caption)Very fast for narrow tasks.
Apple Vision (VNRecognizeTextRequest)n/amacOSAppleSolid OCRWhat locara-vision-ocr wraps today.

Infrastructure required

Inference

  • ✅ OCR specialist: locara-vision-ocr (macOS Vision native — zero RAM, no model download).
  • ❌ General VLM crate (locara-vlm) for caption / VQA / document parsing — would back to llama.cpp’s mtmd path on Apple Silicon.

Input

  • Image input pipeline in the SDK (file → bytes → decoded tensor). Biggest single missing rail — 9 modalities depend on it.
  • File picker via fs.user-selected.
  • Optional text prompt.

Output

  • ✅ For OCR: structured text + bounding boxes via ocr.from_bytes.
  • ❌ For VLM: streaming token Channel (same shape as text-to-text).

Storage

  • ✅ Weights via locara-models::Cache.
  • App-side: extracted OCR results / captions persisted via locara-storage.

Interaction (IPC + SDK)

  • ✅ IPC: ocr.from_bytes (built on locara-vision-ocr).
  • ❌ IPC stub: vlm.describe, vlm.ask reserved in spec, not implemented.
  • Picker UI for “OCR vs caption vs Q&A” intent — not yet a shared component.

Capabilities (manifest)

  • capabilities.fs.user-selected: true for file-input apps.
  • capabilities.device.camera for live VLM apps (cool-down semantics not enforced).
  • capabilities.models[] for the VLM model.

Gaps

  • General VLM crate (locara-vlm) probably best backed by llama.cpp’s mtmd path for Apple Silicon.
  • vlm.describe IPC stub exists in spec, not implemented.
  • Cross-platform OCR fallback (current crate is macOS-only).
  • Image input pipeline in the SDK.

See also