`image-to-text`

HF alias — see image-text-to-text.

HuggingFace lists image-to-text separately because of historical caption-only models (BLIP, GIT). In modern practice the same VLMs handle captioning, VQA, OCR, and document parsing — Locara collapses them under image-text-to-text.

The “no text input” case (pure caption) is a special case where the SDK passes an empty prompt; same model class, same crate, same IPC.