Locara

document-question-answering

HF group: Multimodal · Status: ❌ not built

What it is

PDF / scanned page + question → answer. Distinct from generic image-text-to-text because layout, tables, and stamps matter. Highly relevant for the DocVault reference app.

Open-weight models

ModelParamsReleasedLicenseQualityNotes
Donut200 M2022MITOCR-free, fine-tunableEncoder-decoder; produces structured JSON.
LayoutLMv3130-340 M2022MITDocVQA SOTA among open small models (ANLS 83.4)OCR + layout multimodal pipeline.
Pix2Struct280 M2023Apache-2.0Strong on infographicsHigh-res image-to-text.
Idefics28 B2024Apache-2.0End-to-end, no OCR pipelineVLM-class; better than LayoutLM on free-form Qs.
Qwen2.5-VL-7B7 B2025Apache-2.0Top end-to-end openSame model as image-text-to-text.

Infrastructure required

Inference

  • ❌ Encoder-decoder runtime (Donut, Pix2Struct) OR VLM runtime (Idefics2, Qwen2.5-VL). Neither built.
  • 🟡 Compose-with-OCR fallback works today: locara-vision-ocr → text-to-text LLM.

Input

  • PDF rasterizer crate (e.g. pdfium-render) — one image per page. Biggest unique infrastructure ask for this modality.
  • File picker via fs.user-selected.

Output

  • Streaming token Channel for free-form answers, structured JSON for fielded extraction (Donut returns JSON).

Storage

  • ❌ Weights cache.
  • App-side: extracted answers + page snapshots persisted via locara-storage.

Interaction (IPC + SDK)

  • doc.ask({ path, question }) IPC.
  • Page-level chunking + multi-page reasoning typically routed via visual-document-retrieval first.

Capabilities (manifest)

  • capabilities.fs.user-selected: true for the PDF.
  • capabilities.models[] for the DocVQA model.

Gaps

  • PDF rasterizer crate (shared with visual-document-retrieval).
  • Encoder-decoder OR VLM runtime.
  • DocVault reference app would benefit directly.

See also