OmniHuman-1.5: ByteDance's Avatar Model That Generates Performance, Not Just Lip Sync

OmniHuman-1.5 from ByteDance animates a portrait image with an audio clip, but the goal is performance -- gestures, emotion, and intent -- not just mouth movement. Here is how the architecture works and what it means for practical use.

Z.Tools blog OG image: omnihuman-1-5-avatar-video

why this version feels different

Most talking-avatar demos ask you to forgive the body. The lips move, the face keeps roughly the same identity, and the rest of the person behaves like a cardboard cutout. That can be enough for a quick dubbing test, but it falls apart when the speaker needs to look amused, defensive, surprised, nervous, or simply awake.

OmniHuman 1.5, developed by ByteDance, is aimed at that missing middle between lip sync and performance. You give it a single image and an audio track. You can also add a short text prompt if you want a specific camera move, gesture, mood, or bit of action. The model then tries to generate a video where the character's body language follows the speech, not just the syllables.

That distinction matters. A product presenter should lean into a point. A singer should pause with the music instead of chattering through it. A cartoon penguin doing a voiceover can get away with bigger gestures than a realistic executive in a boardroom. OmniHuman 1.5 is interesting because it treats those as different animation problems.

what you can feed it

The public-facing workflow is simple: one reference image, one voice track, and optional text guidance. BytePlus describes the model as creating video from a single image plus multimodal prompts, with native 1080p output and support for rhythmic, emotional, and multi-person performances. In practical hosted tools, the usual sweet spot is still shorter audio. Z.Tools supports audio up to 30 seconds, with 15 seconds being the safer target when you care about consistency.

The image matters more than people expect. A face-only crop can work, but it gives the model less room to perform. A waist-up portrait with clear hands, shoulders, and natural lighting usually gives better body movement. Heavy shadows, sunglasses, strange occlusions, and tiny faces make the job harder. This isn't magic portrait repair. It is animation from evidence, and the evidence in the image sets the range of plausible motion.

Audio quality matters too. Clean speech with natural pacing tends to produce tighter lip sync and more believable expression. Fast, clipped delivery can still work, but it gives the model less time to settle into gestures. Songs are supported, and ByteDance's demos put real emphasis on music, pauses, emotional shifts, and stage-like movement.

the 1.5 upgrade in plain English

OmniHuman 1.0 was already a strong one-image human animation model. Its research page framed the problem as scaling up a one-stage conditioned system: train on mixed motion signals, then let a single model handle portraits, half-body shots, full bodies, cartoons, animals, and singing. It was impressive, especially because many older tools were stuck in talking-head territory.

OmniHuman 1.5 adds a different idea: the avatar should plan before it moves. ByteDance's paper calls this cognitive simulation, borrowing the "System 1" and "System 2" language from psychology. System 1 is the fast reactive layer: mouth shapes, rhythm, small idle motions, the things a model can infer directly from audio. System 2 is the slower planning layer: what is the speaker trying to say, what emotion fits the line, and what gesture would make sense in the scene?

That sounds abstract, but the effect is easy to picture. If the audio says, "Look over there," a basic model may keep the person staring into the camera while the mouth moves. OmniHuman 1.5 is built to infer that the character might glance aside or gesture toward something. If a singer hits a dramatic pause, it can hold a pose instead of filling every second with random motion.

Under the hood, ByteDance uses a multimodal language model to read the image, audio, and optional prompt, then create a plan for the animation. A diffusion transformer handles the actual video synthesis. The important part for normal users is that the model is not treating the audio as a metronome only. It is also trying to read meaning.

the pseudo last frame trick

One of the smarter details in the paper is called the pseudo last frame design. The name is awkward, but the problem is familiar: if a model clings too tightly to the reference image, the character cannot move much. The avatar keeps trying to return to the original pose. That protects identity, but it also makes the result stiff.

ByteDance's solution is to use the reference image more like an identity anchor than a first frame that must be copied. During training, the model learns from real first and last frames. At generation time, the user's image is placed in the last-frame role, so the model has freedom to move while still being pulled back toward the person's appearance.

That is part of why OmniHuman 1.5 can handle larger gestures and camera movement better than a strict portrait animator. It also explains why the input image should not be too cramped. If you want a person to use their arms, give the model an image where arms make visual sense.

benchmarks and reception

The official paper compares OmniHuman 1.5 with academic baselines on portrait animation and full-body animation. On portrait metrics, it sits close to OmniHuman 1.0. For example, on the CelebV-HQ portrait test, OmniHuman 1.5 reports a Frechet Video Distance of 45.771, slightly better than OmniHuman 1.0's 46.393, while OmniHuman 1.0 is a little higher on one lip-sync score. That is a good reminder that version upgrades are not always a clean sweep on every metric.

The stronger story is full-body motion. On the CyberHost full-body test, OmniHuman 1.5 reports a Hand Keypoint Variance score of 72.113 compared with 47.561 for OmniHuman 1.0. In plainer terms, it moves hands and arms through a wider range. That lines up with the demos: the model is less about a face politely talking and more about a character performing.

The user study is also worth noting. In a best-choice comparison with several academic baselines, OmniHuman 1.5 was selected 33 percent of the time, ahead of OmniHuman 1.0 at 22 percent. Human preference studies are never the whole truth, but they are useful here because the difference is partly about whether motion feels context-aware. Objective lip-sync scores do not fully capture that.

Reception has followed the same pattern: people notice the gesture quality, the multi-person demos, the non-human examples, and the fact that text prompts can steer camera behavior and action. The caveat is that most public access still happens through hosted APIs or tool platforms rather than a downloadable research release.

AI 動画生成

AI 動画生成

テキストや画像から動画を作成、既存素材も変換可能

Keep reading