image-to-text
HF alias — see
image-text-to-text.
HuggingFace lists image-to-text separately because of historical
caption-only models (BLIP, GIT). In modern practice the same VLMs
handle captioning, VQA, OCR, and document parsing — Locara
collapses them under
image-text-to-text.
The “no text input” case (pure caption) is a special case where the SDK passes an empty prompt; same model class, same crate, same IPC.