`image-text-to-video`

HF group: Multimodal · Status: ❌ not built

What it is

Image + text prompt → video. Animate a still with an instruction (e.g. “make this person wave”). Distinct from image-to-video (no text instruction).

Open-weight models

Model	Params	Released	License	Quality	Notes
CogVideoX-5B-I2V	5 B	2024	Apache-2.0	Best Mac-runnable I2V	Follows text prompt for motion better than SVD.
CogVideoX1.5-5B-I2V	5 B	2025	Apache-2.0	10 s clips, any resolution	Higher fidelity than 1.0.
Wan 2.2 I2V	~10 B	2026	Apache-2.0	Strong	Cloud-GPU friendly.
HunyuanVideo I2V	13 B	2025	Tencent	Top quality	24 GB+ VRAM.

Infrastructure required

Inference

❌ Diffusion runtime + video diffusion specifics.

Input

❌ Image input pipeline.
Text prompt.

Output

❌ Video file save (10-100 MB clips). Buffer locally → atomic move into fs.user-folder.

Storage

❌ Weights cache.
Output: fs.user-folder.

Interaction (IPC + SDK)

❌ video.animate({ image, prompt }) IPC with long-running task progress (generation takes 30 s – 5 min).
❌ Cancel + resume semantics.

Capabilities (manifest)

capabilities.fs.user-selected for input.
capabilities.fs.user-folder for output.
capabilities.models[] for the I2V model.

Gaps

Same as text-to-video: diffusion runtime, image input pipeline, video output IPC, long-running task progress IPC.

image-text-to-video