SkyReels V4: What to Know About Skywork's Multimodal Video Model

SkyReels V4 is more than another prompt-to-video model with a fresh version number.

The more interesting claim is that Skywork is trying to make video, audio, references, masks, and edits part of one model family. That matters because AI video work is rarely a clean "type prompt, get final clip" process. A real workflow usually involves a prompt, a first frame, a motion reference, an audio reference, a masked edit, or a request to extend an existing shot without breaking the original rhythm.

SkyReels V4 is built around that messier reality.

The technical report describes it as a unified multimodal video foundation model for joint video-audio generation, inpainting, and editing. In the local AI Video Generator configuration, the model ID is skywork:skyreels@v4, exposed through the Skywork provider.

The practical question is not whether the model sounds advanced. It does. The question is where it is worth testing.

My short answer: test SkyReels V4 when the clip needs audio-aware generation, reference-guided control, or video-to-video editing. If you only need a silent five-second visual concept, you may still want to compare it against faster or cheaper models. But if your prompt involves sound, a reference image, a reference clip, or an edit instruction, SkyReels V4 belongs on the shortlist.

What SkyReels V4 does

SkyReels V4 supports the usual video generation modes, but its strength is the combination:

Text-to-video
Image-to-video
Reference-to-video
Video-to-video
Video editing
Optional synchronized audio
Prompt extension

The local configuration exposes three resolution families: 480p, 720p, and 1080p. It supports clips from 3 to 15 seconds. The technical report says the model can generate up to 1080p, 32 FPS, and 15 seconds, with synchronized audio.

SkyReels V4 specs

The important part is the input design. SkyReels V4 can work from text, images, video clips, masks, and audio references. That makes it better suited to directed generation than a model that only accepts a prompt and maybe one start frame.

Here is the local tool-facing spec summary:

Area	SkyReels V4 behavior
Model ID	`skywork:skyreels@v4`
Provider	Skywork
Modes	Text-to-video, image-to-video, reference-to-video, video-to-video, video-edit
Duration	3-15 seconds
Resolution families	480p, 720p, 1080p
Aspect ratios	16:9, 9:16, 4:3, 3:4, 1:1
Prompt limit	5,120 characters
Frame images	First and last frames
Reference images	Up to 3
Reference videos	Up to 1
Audio input	Up to 1
Provider settings	`audio`, `promptExtend`

This is the kind of table I want before choosing a model. It says what the model can accept, what the output can look like, and what settings are actually exposed in the tool.

What the technical report claims

Skywork's arXiv report is unusually useful because it does not only list features. It explains the model as a dual-stream architecture for joint video-audio generation, with cross-attention between audio and video so the two modalities can stay synchronized.

The report also describes several training phases:

Phase	Purpose
Video pretraining	Learn visual generation and editing across images and videos
Audio pretraining	Learn audio generation from large audio datasets
Video-audio joint training	Align video and audio generation
Supervised fine-tuning	Improve instruction following and production behavior

For non-research users, the takeaway is simpler: SkyReels V4 is trying to avoid the "generate silent video, then add sound later" workflow. That does not mean every interface exposes every audio feature equally, but the model's design is clearly audio-aware.

The report also says the model supports multi-shot sequences suitable for film-like output. I would treat that as an invitation to test scene structure, not as permission to overload one prompt with ten actions. AI video still behaves better when the shot has a clear subject, one main motion idea, and a duration that matches the requested action.

The benchmark data worth citing

Skywork introduced SkyReels-VABench for audio-video evaluation. According to the technical report, the benchmark uses 2000+ curated prompts and a panel of 50 professional evaluators. It covers text-to-video and image-to-video tasks, with and without audio.

The paper reports these SkyReels V4 ranks:

Track	Reported rank
Text-to-video with audio	#1
Text-to-video without audio	#2
Image-to-video with audio	#4
Image-to-video without audio	#7

SkyReels-VABench reported ranks

The benchmark dimensions are also worth noting. The report describes human evaluation across prompt following, audio-visual synchronization, visual quality, motion quality, and audio quality. That is a better fit for this model than a visual-only leaderboard, because SkyReels V4 is partly selling the idea that audio and video should be generated together.

There is still a caveat. SkyReels-VABench is proposed by the same group that released the model. That does not make it useless, but it does mean the rankings should be read as evidence, not as final proof. For production decisions, I would use the benchmark to decide what to test, then run the same prompt across the models you actually have access to.

Pricing changes by workflow

SkyReels V4 is not priced as one flat "video model" in the local configuration. The workflow matters.

Resolution	Text/image-to-video	Video-to-video/edit
480p	$0.11/s	$0.18/s
720p	$0.14/s	$0.25/s
1080p	$0.35/s	$0.625/s

SkyReels V4 pricing matrix

That difference is useful because it matches compute intuition. Transforming or editing an existing video is usually more expensive than generating from text or a still image. A 15-second 1080p video-to-video/edit run can get expensive quickly.

So the testing workflow should be conservative:

Start with 480p or 720p.
Keep the clip near 5 seconds unless the motion genuinely needs more time.
Test prompt extension both on and off if the model is changing your intent too much.
Only move to 1080p after the motion and audio direction work.
Use video-to-video or edit mode when you need that control, not as a default.

This is not about being cheap for its own sake. It is about learning before paying for a full-quality render.

Where SkyReels V4 should be strong

SkyReels V4 is most interesting when a clip has more than one type of input.

Good candidates:

A reference image plus generated motion and sound
A video clip that needs subject replacement or a style change
A prompt that depends on ambient audio, speech, music, or sound effects
A short cinematic sequence with multiple visual cues
A first-and-last-frame animation where the transition matters
A masked edit to change part of an existing video
A reference-video workflow where motion should guide the output

The paper's examples include image plus audio reference, image plus motion reference, subject plus motion reference, masked edits, scene attribute changes, style transfer, background replacement, and watermark/text removal examples. That is a broad editing surface. In a hosted tool, the exact exposed controls may be narrower, so the important move is to check the interface, more than the paper.

Where I would be careful

SkyReels V4 has a lot of knobs. That is useful, but it also means it is easier to ask for incompatible inputs.

The local configuration has several constraints:

Provide a frame image, reference video, or width/height/resolution.
Reference images cannot be combined with reference videos.
Reference videos can be used for reference or extension workflows.
Reference videos must be no longer than 15 seconds.
Video-to-video pricing is higher than text/image-to-video.
Audio cannot always be enabled with every reference-video configuration.

That last part matters. "The model supports audio" does not mean "every mode with every reference can use audio." For an article or a product workflow, that distinction should be explicit.

I would also be careful with multi-shot prompting. The technical report talks about multi-shot capability, but a crowded prompt can still produce a clip that feels confused. If the scene has a setup, a cut, a reaction, and an audio beat, it may be better to generate separate clips and edit them together.

How I would test it

I would test SkyReels V4 with three prompts, not one.

First, run a clean text-to-video prompt with audio disabled. This tells you the base visual style, motion stability, and camera behavior.

Second, run a similar prompt with audio enabled. Listen for whether the sound helps or becomes noise. Watch whether the visual timing feels more intentional.

Third, run a reference-guided test. Use a source image or short video and ask for a constrained transformation. This is where SkyReels V4 should justify its extra complexity.

Judge the outputs on:

Prompt following
Motion stability
Audio-video sync
Subject consistency
Edit faithfulness
Cost at the resolution you actually need

If the model only wins on one dimension, use it for that dimension. AI video workflows do not need one permanent champion. They need the right model for the next clip.

Bottom line

SkyReels V4 is worth writing about because it makes a serious attempt at multimodal video production rather than isolated prompt-to-video generation. The paper's benchmark claims are strong, especially for text-to-video with audio, and the local tool configuration exposes the model in a way that matches its pitch: text, images, reference videos, edits, optional audio, and up to 1080p output.

I would not use it blindly for every AI video task. I would test it when audio, references, or video edits matter. For simple silent drafts, compare it against cheaper or faster options first.

AI Video Generator

Create videos from text, images, or transform existing footage