`document-question-answering`

HF group: Multimodal · Status: ❌ not built

What it is

PDF / scanned page + question → answer. Distinct from generic image-text-to-text because layout, tables, and stamps matter. Highly relevant for the DocVault reference app.

Open-weight models

Model	Params	Released	License	Quality	Notes
Donut	200 M	2022	MIT	OCR-free, fine-tunable	Encoder-decoder; produces structured JSON.
LayoutLMv3	130-340 M	2022	MIT	DocVQA SOTA among open small models (ANLS 83.4)	OCR + layout multimodal pipeline.
Pix2Struct	280 M	2023	Apache-2.0	Strong on infographics	High-res image-to-text.
Idefics2	8 B	2024	Apache-2.0	End-to-end, no OCR pipeline	VLM-class; better than LayoutLM on free-form Qs.
Qwen2.5-VL-7B	7 B	2025	Apache-2.0	Top end-to-end open	Same model as image-text-to-text.

Infrastructure required

Inference

❌ Encoder-decoder runtime (Donut, Pix2Struct) OR VLM runtime (Idefics2, Qwen2.5-VL). Neither built.
🟡 Compose-with-OCR fallback works today: locara-vision-ocr → text-to-text LLM.

Input

❌ PDF rasterizer crate (e.g. pdfium-render) — one image per page. Biggest unique infrastructure ask for this modality.
File picker via fs.user-selected.

Output

Streaming token Channel for free-form answers, structured JSON for fielded extraction (Donut returns JSON).

Storage

❌ Weights cache.
App-side: extracted answers + page snapshots persisted via locara-storage.

Interaction (IPC + SDK)

❌ doc.ask({ path, question }) IPC.
Page-level chunking + multi-page reasoning typically routed via visual-document-retrieval first.

Capabilities (manifest)

capabilities.fs.user-selected: true for the PDF.
capabilities.models[] for the DocVQA model.

Gaps

PDF rasterizer crate (shared with visual-document-retrieval).
Encoder-decoder OR VLM runtime.
DocVault reference app would benefit directly.

See also

image-text-to-text
visual-document-retrieval
Crate: locara-vision-ocr (OCR fallback)
Index: ../modalities-and-models-survey.md