Turn reference images or videos into coherent AI clips with Veo 3.1 and Wan 2.6
Reference to Video is built for visual consistency: upload multiple references, describe motion clearly, and generate fast 8-second clips suitable for social, concept, and creative workflows.
Use one to three images as visual anchors so subjects, style, and key composition cues remain more consistent across the generated clip.
Runs on Veo 3.1 fast generation mode for a practical balance between turnaround speed and stable motion quality in everyday production.
Choose Auto, 16:9, or 9:16 to match your distribution target, from landscape explainers to vertical short-form social formats.
Generated videos support background audio behavior defined by the upstream model, helping you move faster from generation to review.
Requests are submitted with translation enabled to improve prompt interpretation reliability in multilingual workflows and global teams.
A fixed 8-second output keeps timing predictable for iteration, storyboard tests, and quick side-by-side model comparisons.
Upload references, describe motion, and create a ready-to-review clip in minutes.
Add 1 to 3 reference images that define the subject, mood, and visual direction for your generated video.
Describe camera movement, subject behavior, and transition intent. Set aspect ratio, then submit generation.
Preview the 8-second result, inspect details in history, and download the clip for editing or publishing.
You can upload 1 to 3 reference images. At least one image is required, and uploads above three images are rejected by validation.
This page uses Veo generation type REFERENCE_2_VIDEO with the fast Veo 3.1 model profile, optimized for guided reference-based motion generation.
Not in this model setup. Duration is fixed at 8 seconds for predictable iteration speed and stable workflow behavior across repeated runs.
You can choose Auto, 16:9, or 9:16. Auto is convenient for quick tests, while explicit aspect ratios are better for production delivery targets.
Yes. Requests are sent with translation enabled by default to improve prompt interpretation consistency when prompts are not originally written in English.
The model supports audio-capable output behavior from the upstream pipeline. In rare sensitive scenarios, audio may still be suppressed by provider policy.
No. This model is scoped to the dedicated Reference to Video page, so existing Text to Video and Image to Video model lists remain unchanged.
Use clear, high-quality references and precise motion prompts. Explicit camera verbs and scene intent usually improve consistency and reduce random drift.