Ovi

Twin backbone cross-modal fusion for audio-video generation

Chetwin Low^* Weimin Wang^{*, †} Calder Katyal

Character AI Yale University

^* Equal contribution, ^† Project Lead

Research Paper GitHub Model Card Huggingface Paper Create videos on wavespeed.ai! Create videos on huggingface! X Post

Ovi 1.1 is now 10 seconds

Ovi 1.1 extends our original 5-second video+audio generation to 10 seconds, enabling richer storytelling, longer dialogue, and more expressive avatars.

The Last Stand Against Ovi

All clips were created by Ovi, using only text or text+image as inputs. Video resized down to 480p to save space. Please turn on the sound for watching.

Key Features

High-Quality Synchronized Audio

We pretrained from scratch our high-quality 5B audio branch using a mirroring architecture of WAN 2.2 5B, as well as our 1B fusion branch.

Data-Driven Lip-sync Learning

Achieving precise lip synchronization without explicit face bounding boxes, through pure data-driven learning

Multi-Person Dialogue Support

Naturally extending to realistic multiple speakers and multi-turn conversations, making complex dialogue scenarios possible

Contextual Sound Generation

Creating synchronized background music and sound effects that match visual actions

OSS Release to Expedite Research

We are excited to release our full pre-trained model weights and inference code to expedite video+audio generation in OSS community.

Videos below were upsampled for visual appeal using standard tools. Audio remains original.

Human-centric AV Generation from Text & Image (TI2AV)

Given a starting first frame and text prompt, Ovi generates a high quality video with audio.
All videos below have their first frames generated from an off-the-shelf imagen model.

☕

"A young woman with long, wavy blonde hair and light-colored eyes is shown in a medium shot against a blurred backdrop of lush green foliage. She wears a denim jacket over a striped top. Initially, her eyes are closed and her mouth is slightly open as she speaks, <S>Enjoy this moment<E>. Her eyes then slowly open, looking slightly upwards and to the right, as her expression shifts to one of thoughtful contemplation. She continues to speak, <S>No matter where it's taking<E>, her gaze then settling with a serious and focused look towards someone off-screen to her right.. <AUDCAP>Clear female voice, faint ambient outdoor sounds<ENDAUDCAP>"

☕

"The video opens with a close-up of a woman with vibrant reddish-orange, shoulder-length hair and heavy dark eye makeup. She is wearing a dark brown leather jacket over a grey hooded top. She looks intently to her right, her mouth slightly agape, and her expression is serious and focused. The background shows a room with light green walls and dark wooden cabinets on the left, and a green plant on the right. She speaks, her voice clear and direct, saying, <S>doing<E>. She then pauses briefly, her gaze unwavering, and continues, <S>And I need you to trust them.<E>. Her mouth remains slightly open, indicating she is either about to speak more or has just finished a sentence, with a look of intense sincerity.. <AUDCAP>Tense, dramatic background music, clear female voice.<ENDAUDCAP>"

☕

"A bearded man wearing large dark sunglasses and a blue patterned cardigan sits in a studio, actively speaking into a large, suspended microphone. He has headphones on and gestures with his hands, displaying rings on his fingers. Behind him, a wall is covered with red, textured sound-dampening foam on the left, and a white banner on the right features the ""CHOICE FM"" logo and various social media handles like ""@ilovechoicefm"" with ""RALEIGH"" below it. The man intently addresses the microphone, articulating, <S>is talent. It's all about authenticity. You gotta be who you really are, especially if you're working<E>. He leans forward slightly as he speaks, maintaining a serious expression behind his sunglasses.. <AUDCAP>Clear male voice speaking into a microphone, a low background hum.<ENDAUDCAP>"

☕

A young Black woman with dark, curly hair sits with her head bowed and bare shoulders, positioned to the right of the frame in what appears to be a dimly lit, hazy room. She looks downcast, appearing distressed or submissive. In the blurred background, a stern-faced woman with light hair in a dark uniform stands, looking directly towards the young woman. Another uniformed figure is faintly visible further back in the room. The uniformed woman in the foreground takes a deliberate step forward, her gaze remaining fixed on the young woman. She then speaks, her voice firm, <S>Come here.<E> The young woman remains still, her head still bowed, not immediately responding to the command.. <AUDCAP>Muffled clattering/rattling sounds, soft footsteps, indistinct background murmuring, a stern female voice.<ENDAUDCAP>

☕

A man with a beard, wearing a patterned shirt, stands on the left, partially visible, looking towards a woman positioned slightly to the right of the frame. The woman, with dark hair fading to lighter ends and wearing a green and brown patterned top, initially looks down with a somber expression. She begins to speak, <S>Hope beats circuits every time.<E>. Her eyes appear to well up with tears as she slowly lifts her gaze slightly, maintaining a distressed look. She continues her statement, her voice tinged with sadness, <S>Humanity endures beyond your code.<E>. The man remains attentive, his focus entirely on the woman, as the scene holds on their interaction against a textured, light-colored wall background.. <AUDCAP>Female voice speaking with a distressed tone.<ENDAUDCAP>

☕

A close-up shot shows a woman with her eyes closed, wearing makeup including red lipstick, as she begins to speak, <S>The truth<E>. Her right hand is slightly raised. She then opens her eyes, looking down and to the left, a slight smile playing on her lips, and continues, <S>is that<E>. A microphone becomes visible in the lower part of the frame. She continues to speak, her expression shifting slightly as she adds, <S>it's actually a really sad tale<E>. She looks down intently, her mouth slightly open, as if pausing or reflecting on her words. The background features a vibrant pink wall and a dark, reflective surface to her left.. <AUDCAP>Clear female speech, slight room reverberation.<ENDAUDCAP>

☕

A zoomed in close-up shot of a man in a dark apron standing behind a cafe counter, leaning slightly on the polished surface. Across from him in the same frame, a woman in a beige coat holds a paper cup with both hands, her expression playful. The woman says <S>You always give me extra foam.<E> The man smirks, tilting his head toward the cup. The man says <S>That’s how I bribe loyal customers.<E> Warm cafe lights reflect softly on the counter between them as the background remains blurred.. <AUDCAP>Female and male voices speaking English casually, faint hiss of a milk steamer, cups clinking, low background chatter.<ENDAUDCAP>>

☕

A Black man with a short beard and dark hair stands center stage, illuminated by stage lights against a rippled royal blue curtain backdrop. He wears a white denim jacket over a mustard yellow t-shirt and holds a silver microphone in his right hand. He begins speaking, looking towards his left with a slight frown, <S>to<E> <S>cold like a month ago<E>. He then turns his gaze slightly right, his expression becoming more animated, his mouth open as if exclaiming. He shifts his weight, gesturing subtly with his microphone hand while continuing to speak, his head nodding slightly. He turns back to the left, his eyes wide and mouth open again, before stating with a direct look, <S>It's cold now, okay?<E> He then looks to his right, and his mouth briefly opens as if to say more, <S>I could<E>.. <AUDCAP>Male speaking voice, sound of audience laughter.<ENDAUDCAP>

☕

The video opens with a medium shot of an older man with light brown, slightly disheveled hair, wearing a dark blazer over a grey t-shirt. He sits in front of a theatrical backdrop depicting a large, classic black and white passenger ship named "GLORIA" docked in a harbor, framed by red stage curtains on either side. The lighting is soft and even. As he speaks, he gestures expressively with both hands, often raising them and then bringing them down, or making a fist. His facial expression is animated and engaged, with a slight furrow in his brow as he explains. He begins by saying, <S>to help them through the grimness of daily life.<E> He then raises his hands again, gesturing outward, and continues speaking in a different language, <S>Da brauchst du natürlich Fantasiebilder.<E> His gaze is directed slightly off-camera as he conveys his thoughts.. <AUDCAP>Male voice speaking clearly and conversationally.<ENDAUDCAP>

☕

A close-up shot focuses on a woman's face, positioned to the right of the frame, bathed in dim, almost dark lighting, with a faint blue/purple glow emanating from the left side. Her dark hair is visible, and her face expresses deep sadness, her eyes appearing moist and her brow furrowed as if struggling with emotion. She begins to speak, her voice trembling slightly, <S>No puedo olvidar aquella noche.<E> Her expression intensifies, tears welling in her eyes as she continues, her voice breaking slightly, <S>La recuerdo todo<E>, before the scene cuts off. The somber atmosphere is conveyed through her distressed demeanor and the low lighting.. <AUDCAP>Soft, melancholic background music, a female voice speaking emotionally, close to tears.<ENDAUDCAP>

☕

A man with dark hair, wearing a dark top, is shown in a close-up, illuminated by an intense, deep red and pink light that casts a strong hue over the entire scene. His head is slightly tilted down, and his eyes are initially looking downwards. In the blurred background, another figure, possibly a child, is faintly visible to the left. The man's eyes briefly close, then open, and he looks slightly to his right, his expression appearing contemplative. He then articulates, <S>Who the hell is Mark?<E> His gaze remains fixed as he asks the question, his brow slightly furrowed.. <AUDCAP>A continuous low-frequency hum, a very faint, almost inaudible high-pitched tone just before the dialogue, a man's voice.<ENDAUDCAP>

☕

A close-up shot features an East Asian man with dark, dishevelled hair and a short beard or stubble, his brow furrowed in intense concentration. He wears a light grey or blue bomber jacket over a white collared shirt. His eyes are wide open, fixed on something below or in front of him, and his mouth is slightly agape. <S>제가<E> he states, his voice low and strained. He blinks slowly, his eyes closing for a moment before reopening with an even more intense, pained expression. The arm of another person, clad in a dark sleeve, is visible behind his left shoulder, seeming to apply pressure. He continues, <S>과거에 과장님께 뭔가<E> as a loud, high-pitched ringing sound begins and persists, coinciding with his strained utterance.. <AUDCAP>Faint ambient hum, high-pitched continuous ringing sound.<ENDAUDCAP>

Human-centric AV Generation from Text (T2AV)

Given a text prompt only, Ovi generates a high quality video with audio.
Videos generated include large motion ranges, multi-person conversations, and diverse emotions.

☕

Multi Person AV Generation from Text or Image (TI2AV)

Given a text prompt with optional starting image, Ovi generates a video with multi person dialogue.

☕

Sound effect (SFX) AV Generation from Text w or w/o Image (TI2AV or T2AV)

Given a text prompt with optional starting image, Ovi generates a video with high-quality sound effects.

☕

Music Instrumeent AV Generation from Text w or w/o Image (TI2AV or T2AV)

Given a text prompt with optional starting image, Ovi generates a video with music.

☕

Ovi Team

Project and Team Leader Weimin Wang

Modeling and designs Chetwin Low, Weimin Wang

Codebase for training and data pipeline Chetwin Low, Weimin Wang, Calder Katyal

Data Calder Katyal, Yi Cui, Diego De La Torre

Limitations

All models have limits, including Ovi

Video branch constraints. Visual quality inherits from the pretrained WAN 2.2 5B ti2v backbone.
Speed/memory vs. fine detail. The 11B parameter model (5B visual + 5B audio + 1B fusion) and high spatial compression rate balance inference speed and memory, limiting extremely fine-grained details, tiny objects, or intricate textures in complex scenes.
Human-centric bias. Data skews toward human-centric content, so Ovi performs best on human-focused scenarios. The audio branch enables highly emotional, dramatic short clips within this focus.
Pretraining only stage. Without extensive post-training or RL stages, outputs vary more between runs. Tip: Try multiple random seeds for better results.

Ethical Considerations

The reference images are sourced from public domains or generated by AI models, and are intended solely to demonstrate the capabilities of this research. If there are any concerns, please contact us (weiminwang@character.ai) and we will delete them.

BibTeX

          @misc{low2025ovitwinbackbonecrossmodal,
            title={Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation}, 
            author={Chetwin Low and Weimin Wang and Calder Katyal},
            year={2025},
            eprint={2510.01284},
            archivePrefix={arXiv},
            primaryClass={cs.MM},
            url={https://arxiv.org/abs/2510.01284}, 
      }