image-text-to-video
HF group: Multimodal · Status: ❌ not built
What it is
Image + text prompt → video. Animate a still with an instruction
(e.g. “make this person wave”). Distinct from
image-to-video (no text instruction).
Open-weight models
| Model | Params | Released | License | Quality | Notes |
|---|---|---|---|---|---|
| CogVideoX-5B-I2V | 5 B | 2024 | Apache-2.0 | Best Mac-runnable I2V | Follows text prompt for motion better than SVD. |
| CogVideoX1.5-5B-I2V | 5 B | 2025 | Apache-2.0 | 10 s clips, any resolution | Higher fidelity than 1.0. |
| Wan 2.2 I2V | ~10 B | 2026 | Apache-2.0 | Strong | Cloud-GPU friendly. |
| HunyuanVideo I2V | 13 B | 2025 | Tencent | Top quality | 24 GB+ VRAM. |
Infrastructure required
Inference
- ❌ Diffusion runtime + video diffusion specifics.
Input
- ❌ Image input pipeline.
- Text prompt.
Output
- ❌ Video file save (10-100 MB clips). Buffer locally → atomic move into
fs.user-folder.
Storage
- ❌ Weights cache.
- Output:
fs.user-folder.
Interaction (IPC + SDK)
- ❌
video.animate({ image, prompt })IPC with long-running task progress (generation takes 30 s – 5 min). - ❌ Cancel + resume semantics.
Capabilities (manifest)
capabilities.fs.user-selectedfor input.capabilities.fs.user-folderfor output.capabilities.models[]for the I2V model.
Gaps
Same as text-to-video: diffusion runtime,
image input pipeline, video output IPC, long-running task
progress IPC.