Prompting Seedance 2.0 with 12-File Multimodal References
How to actually use the 9 image, 3 video, 3 audio reference surface on reference-to-video: file weighting, mixed vs image only, and when to skip audio refs entirely.
Seedance 2.0's reference to video endpoint takes up to 12 files per call. That sounds generous until you try to balance them and realize the model doesn't follow all of them equally. This post shows you how to budget the 12, weight which the model listens to, and pick between image only and mixed modes.
The Rule of 12, restated clearly
The per call budget on bytedance/seedance-2.0/reference-to-video is fixed:
- 9 images, each under 30MB. JPG, PNG, WebP accepted.
- 3 videos, combined 2 to 15 seconds, combined size under 50MB. MP4 or MOV.
- 3 audio tracks, combined under 15 seconds, each under 15MB. MP3 or WAV.
You can submit any subset. The endpoint does not require all three kinds. Image only calls are fine and should be your default for most production jobs.
How the model weights your references
From testing against the April 15 fal release, reference priority is roughly: video clips > audio clips > image set. A single 3 second video reference dominates the motion field more than any image stack. Audio influences timing and scene rhythm hard once you introduce it. Images set style, palette, and subject identity.

To hold a subject tightly across the clip, use 3 to 5 images of that subject from different angles. Varied angles beat many images of the same pose.
When to use image only
Most product shots. Most character turnarounds. Most style transfer work. If you can describe the motion in words and your job is pinning down who or what is on screen, feed 3 to 5 clean images and let the prompt handle action verbs. Image only calls also render faster because preprocessing is lighter.
When to bring video and audio in
Video reference earns its slot when motion is hard to describe in words: a specific camera arc, a dance move, a physics reaction like cloth wrapping, a sports form. Two to three seconds is enough. Longer clips dilute the signal.
Audio reference is for rhythm matching. Cutting to a beat and want motion on the downbeat? A short audio cue under 5 seconds helps. It also helps with footsteps, impacts, and weapon swings when the visual must match a known sound.
The image only prompt that ships clean
01import { fal } from "@fal-ai/client";0203const result = await fal.subscribe("bytedance/seedance-2.0/reference-to-video", {04 input: {05 prompt: "The subject walks toward camera through a neon lit hallway, holding the reference product at chest height, slow push in, warm key light from the left.",06 reference_images: [07 "https://v3.fal.media/files/subject-front.jpg",08 "https://v3.fal.media/files/subject-three-quarter.jpg",09 "https://v3.fal.media/files/product-hero.jpg"10 ],11 resolution: "720p",12 duration: 6,13 aspect_ratio: "9:16",14 generate_audio: false15 },16 logs: true17});1819console.log(result.data.video.url);
generate_audio is off because this vertical ad gets scored in post. aspect_ratio: "9:16" picks vertical directly, no cropping.
Mixed mode: when you need it
Sports brand piece: 3 images of the athlete, 1 short video clip of the footwork, 1 audio cue (shoe squeak plus crowd pop). That is 5 files out of 12, sending a tight signal across three modalities. Model locks subject from images, motion from video, timing from audio. You spend 3 calls on prompt variations rather than 15 fighting motion.

Common mistakes to avoid
- Throwing 9 images of the same subject in the same pose. Wasted budget. 3 varied angles do more.
- Using long video references. A 15 second reference competes with your 15 second output. Keep references under 4 seconds.
- Adding audio to a silent output. If you pass audio refs but set
generate_audio: false, the audio still shapes motion timing. Decide consciously. - Letting payload creep past per file caps. A 31MB image rejects the whole call. Pre compress before you submit.
Budgeting for iteration
Reference to video at 720p 6 seconds costs $0.3024 per second on Standard, so $1.81 per clip. Pick your 3 or 4 most important references, lock them, and vary only the prompt across a 10 to 20 clip sweep. Do not vary references and prompt together or you won't know what moved the needle.