Wan2.7: Alibaba's Open Video Model Gets Sharper Controls and a Longer Prompt Window

Wan2.7 adds first/last-frame control, 9-grid multi-image input, instruction-based video editing, and a 5000-character prompt limit. It runs through the same Wan architecture but with tighter motion consistency and more capable reference workflows. Here is what changed and when to use it over a proprietary model.

Z.Tools blog OG image: alibaba-wan2-7-video

why Wan 2.7 (Alibaba) is worth a closer look

Wan 2.7 (Alibaba) is a practical update, not a vague "better quality" release. The useful change is control. Alibaba's April 2026 launch post frames the model as a fuller video creation system, with text, image, video, and audio inputs feeding generation, reference work, continuation, and editing. The official Model Studio pages then make the shape clearer: the suite covers text to video, image to video, reference guided video, and instruction based editing.

That matters because short AI video is no longer just about asking for five seconds of motion and hoping the camera behaves. The hard part is repeatability. Can you hold a character across shots? Can you define where a clip begins and where it should land? Can you edit an existing clip without throwing away the motion? Wan 2.7 (Alibaba) is aimed at those problems.

It still lives in the short-form zone. This is not a replacement for a timeline editor, a compositor, or a human director. But compared with older Wan versions, it gives you more levers before you reach for those tools.

the launch: four video jobs, one family

Alibaba introduced Wan 2.7 (Alibaba) video shortly after the Wan 2.7 image release. The video side is split into four jobs: prompt-only generation, image-led animation, reference-driven generation, and video editing. In plain terms, you can start from words, start from a still frame, guide the model with people or objects you want to keep consistent, or hand it a clip and ask for a style or content change.

The official model overview lists 720p and 1080p output, 30 fps, and MP4 delivery. Text and image generation can run from two to fifteen seconds. Reference and editing workflows are shorter, from two to ten seconds. That shorter cap makes sense. Once the model has to preserve a person, voice, pose, action, or source clip, the problem gets tighter.

Wan 2.7 (Alibaba) also supports synchronized audio in the main generation paths. For text prompts, Alibaba recommends the new version when you need narration, sound effects, or background music. For image-led generation, the API can use a supplied audio file as a driving source, or generate matching audio when you do not provide one. In editing, the audio behavior is more conservative: it can preserve the original track, or decide whether to regenerate sound based on the requested change.

That last detail is easy to overlook. Audio support is not one single feature across every mode. Sometimes the model creates sound with the video. Sometimes audio drives timing. Sometimes the safest result is to keep the original audio.

what changed from the older Wan workflow

The most visible upgrade is first and last frame control. Older image-to-video flows were usually anchored at the start: give the model a first frame, describe motion, and accept where it ends. Wan 2.7 (Alibaba) can take both endpoints. You provide the opening image and the closing image, then the model fills in the motion between them.

That is useful for product shots, social ads, storyboard tests, and any sequence where the final pose matters. A shoe rotates from a side view to a hero angle. A character turns from camera left to camera right. A room changes lighting state while the composition stays planned. You still need to test, but you are no longer asking the model to invent the destination.

Video continuation is the second big piece. The image-to-video API lets an existing clip act as the beginning of a longer result. Alibaba's own documentation describes a case where a short input clip is continued until the total output reaches the requested duration. This is one of the cleaner ways to build sequences from short AI clips: generate a section, inspect it, then extend from a usable endpoint.

Instruction based editing is the third. Wan 2.7 (Alibaba) can take an existing clip and a text instruction, then change the scene, style, objects, or presentation without starting over. The examples in Alibaba's launch material lean into broad creative edits: character action, dialogue, appearance, scenery, visual style, and camera treatment. I would still treat this as draft-grade editing. It is best for fast exploration, not final client delivery.

the prompt window is longer, but that does not make prompts easier

Wan 2.7 (Alibaba) supports prompts up to 5,000 characters for the newer image-led API path, with negative prompts up to 500 characters. Runware's model page exposes the same 5,000-character positive prompt limit for its Wan 2.7 (Alibaba) integration.

Longer prompts help when the scene has real structure: shot order, lighting, subject behavior, wardrobe, camera movement, audio intent, and constraints that should not drift. They also make it easier to describe multi-shot sequences in one request.

The trap is writing a bloated prompt because the window allows it. More words can help, but only when they reduce ambiguity. A clean brief usually beats a giant paragraph stuffed with adjectives. For Wan 2.7 (Alibaba), I would spend the extra space on sequence and continuity: what stays fixed, what changes, how the camera moves, and what the sound should imply.

reference work and consistency

Reference-driven generation is where Wan 2.7 (Alibaba) feels most production-oriented. Alibaba describes the reference model as a way to maintain character consistency across scenes. The public launch post also talks about keeping visual identity and voice tone for several distinct characters. Runware's implementation exposes reference image and reference video workflows alongside ordinary text and image generation.

The practical use case is simple: when a subject matters, do not rely on the prompt alone. Give the model visual material. That can mean a product from several angles, a character sheet, a brand object, or a source clip whose motion and style should guide the output.

There are limits. Reference-guided and video-guided runs are capped at ten seconds in the implementation used here. The model can help you keep a character recognizable, but it will not make continuity magically solved. Clothing, hands, small accessories, and exact facial structure still deserve close review.

pricing in normal terms

For this article path, the current Runware pricing for Wan 2.7 (Alibaba) is ten cents per second at 720p and fifteen cents per second at 1080p. The minimum two-second examples are twenty cents at 720p and thirty cents at 1080p.

Both input video seconds and output video seconds can be charged when a video input is involved. That is the pricing detail to watch. A five-second video-guided run that produces five seconds of output at 720p can cost about one dollar because the system bills the source duration and the generated duration. A plain five-second 720p text or image generation is about fifty cents.

At 1080p, a ten-second generation is about one dollar and fifty cents if only output seconds are billed. With a ten-second source video and a ten-second output, the same resolution can land around three dollars.

The main takeaway: Wan 2.7 (Alibaba) is not expensive for quick tests, but reference and video-guided work can double the billable seconds. Keep early explorations short. Once a direction works, raise the duration or resolution.

where it fits against closed video tools

Wan 2.7 (Alibaba) should not be judged only by whether a single clip looks better than the best output from a closed model. That comparison changes every month, and it misses the point. Wan is becoming a workflow family. It is strong when you care about control, repeatable subjects, endpoint frames, and editing passes.

Closed video tools can still win on polish, defaults, and user experience. Some are better for casual creators who just want a beautiful clip without thinking about the mechanics. Others may handle fast physics or cinematic camera motion more confidently on a given prompt.

I would reach for Wan 2.7 (Alibaba) when the shot has constraints: a fixed first frame, a required ending frame, a subject that must survive across clips, or an existing video that needs a controlled transformation. I would choose a more consumer-facing model when the only goal is to get the prettiest surprise from a short prompt.

AI Video Generator

AI Video Generator

Create videos from text, images, or transform existing footage

Keep reading