Seedance 2.0: ByteDance Built This for Editors, Not Prompters
Seedance 2.0 brings unified multimodal video generation to CapCut and beyond. Text, image, audio, and video inputs together, synchronized audio out, 5 to 15 seconds, 480p or 720p. Here is what it actually does and how it compares.
The useful way to think about it
Seedance 2 (ByteDance) is not just another prompt box that spits out a silent clip. The interesting part is where ByteDance put it and what the model tries to solve. It launched publicly through ByteDance Seed in February 2026, then moved into CapCut, Dreamina, and related creator products through a staged rollout that reached the United States in April.
That placement matters. CapCut is already where a lot of short-form video gets cut, captioned, resized, and shipped. A video model inside that workflow has a different job from a standalone demo site. It needs to make usable shots, preserve timing, give editors something they can drop into a timeline, and handle audio before the last export step.
That is why Seedance 2 (ByteDance) feels more like an editing primitive than a novelty generator. You can still type a prompt and hope for the best, but the stronger use case is giving it references and asking for a directed result.
What ByteDance actually announced
ByteDance describes Seedance 2 (ByteDance) as a unified multimodal audio-video generation model. Plain English version: it can take text, images, audio, and video together, then generate a video where the sound and picture are produced as one piece rather than stitched together afterward.
The official launch post says the model can use up to nine images, three video clips, three audio clips, and natural-language instructions in one request. Those references can guide composition, motion, camera movement, style, sound, and editing intent. It also supports video extension and targeted editing, so the model is not limited to creating a fresh clip from a blank prompt.
The same post claims better performance than the previous generation on physical accuracy, visual realism, controllability, multi-subject interaction, and complex motion. I would treat those as vendor claims, not neutral proof. Still, the claimed direction matches what makes the model interesting: it is trying to make one coherent shot sequence, with audio, from messy creative inputs.
The audio is the point
Most text-to-video comparisons focus on visual sharpness, camera motion, and whether hands fall apart. Those still matter. But with Seedance 2 (ByteDance), audio changes the evaluation.
ByteDance says the model supports dual-channel audio and can produce background music, ambient sound, sound effects, and character voiceover aligned with the visual rhythm. Runware's documentation also lists synchronized audio as on by default. The practical difference is simple: you are not asking one system for video and another system for sound effects. You are asking for a clip where footsteps, fabric movement, rain, cuts, voice, and action are meant to arrive together.
That does not mean every result will be production-ready. Audio generation can still drift, flatten, or make strange choices. But a rough clip with synchronized foley is much easier to judge than a silent clip where the editor has to imagine the sound design later. For social video, ads, quick storyboards, and creator tests, that saves time.
Inputs, duration, and formats
For text-to-video work, the user-facing shape is straightforward. You write a prompt, choose an aspect ratio and resolution, decide how long the clip should run, and generate. The model is more flexible when you add references.
The current supported reference mix includes up to nine still images, up to three video references, and up to three audio references. You can also guide motion from a first frame and a last frame when you want a clip to move between known visual endpoints. Those endpoint controls are useful for product shots, character poses, or scenes where the opening and closing composition matter more than free exploration.
On Z.Tools, available durations run from 5 to 15 seconds in one-second steps. The frame rate is 24 frames per second. The exposed resolution tiers are 480p and 720p, with six aspect ratios: widescreen, square, vertical, portrait, standard 4:3, and ultrawide. Runware's public documentation also mentions a 1080p option, but the current Z.Tools-facing pricing and UI are centered on 480p and 720p, so those are the tiers I would plan around.
What it costs right now
Current Runware pricing for the full version is $0.07 per output second at 480p and $0.16 per output second at 720p for text-to-video or image-to-video. Video-to-video costs more because the model has to process an existing clip: $0.13 per second at 480p and $0.28 per second at 720p.
There is also a faster variant. It costs $0.06 per output second at 480p and $0.13 per output second at 720p for text-to-video or image-to-video. For video-to-video, it costs $0.10 per second at 480p and $0.21 per second at 720p.
So a five-second 480p text-to-video test costs about $0.35 on the full version or about $0.30 on the faster one. A 15-second 720p text-to-video clip costs about $2.40 on the full version or about $1.95 on the faster one. Those numbers make the faster variant the obvious place to iterate. Move to the full version when the prompt, references, and shot direction are already close.
Where it sits against other models
Artificial Analysis currently ranks Dreamina's public Seedance entry near the top of its text-to-video leaderboard based on blind user preferences. In the recent leaderboard snapshot I found, HappyHorse 1.0 was first, Seedance 2 (ByteDance) was second at 720p, and Kling 3.0 Pro followed close behind at 1080p. The exact Elo numbers move as voting continues, so the ranking is more useful than pretending one snapshot is permanent.
The important caveat: Artificial Analysis separates no-audio and with-audio views. A silent leaderboard does not fully capture what Seedance 2 (ByteDance) is trying to do. If you only care about the cleanest silent image sequence, HappyHorse may be the more exciting benchmark story right now. If you care about synchronized audio-video output, Seedance 2 (ByteDance) becomes more compelling.
Kling 3.0 is still a serious comparison because it is strong, familiar, and available in high-resolution workflows. But the practical question is not just which model wins a leaderboard round. It is which model gives you a usable draft fastest for the kind of clip you are making.
The CapCut angle
CapCut's newsroom post frames the rollout as a paid-user feature with gradual market expansion. The first named markets included Indonesia, the Philippines, Thailand, Vietnam, Malaysia, Brazil, and Mexico, followed by more regions across Africa, South America, Europe, the Middle East, Japan, and the United States.
The distribution strategy tells you how ByteDance wants people to use the model. It is not asking creators to leave the editing surface, generate a pile of clips elsewhere, download them, and reconstruct the project by hand. It wants generation to sit next to trimming, captions, resizing, overlays, and export.
That is smart. The first generation is rarely the final asset. It is usually a candidate. You scrub it, reject a few seconds, change the prompt, swap a reference, try another aspect ratio, or keep only the first half. A model that lives close to the edit timeline fits that reality better than a model that treats generation as the whole job.

AI 视频生成
文字生成视频、图片转视频或风格化改造现有素材