Ovi Robot Logo

Ovi

Twin backbone cross-modal fusion for audio-video generation

Character AI Yale University

* Equal contribution, Project Lead

Research Paper GitHub Huggingface Weights

The Last Stand Against Ovi

All clips were created by Ovi, using only text or text+image as inputs. Video resized down to 480p to save space. Please turn on the sound for watching.

Key Features

High-Quality Synchronized Audio

Generating videos with high quality audios that perfectly match character identity, gender, emotions, pauses, and context

Data-Driven Lip-sync Learning

Achieving precise lip synchronization without explicit face bounding boxes, through pure data-driven learning

Multi-Person Dialogue Support

Naturally extending to realistic multiple speakers and multi-turn conversations, making complex dialogue scenarios possible

Contextual Sound Generation

Creating synchronized background music and sound effects that match visual actions

OSS Release to Expedite Research

We are excited to release our full pre-trained model weights and inference code to expedite video+audio generation in OSS community.

Videos below were upsampled for visual appeal using standard tools. Audio remains original.
Human-centric AV Generation from Text (T2AV)
Given a text prompt, Ovi generates a high quality video with audio.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Human-centric AV Generation from Text & Image (TI2AV)
Given a starting first frame and text prompt, Ovi generates a high quality video with audio.
All videos below have their first frames generated by image gen model.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Multi Person AV Generation from Text or Image (TI2AV)
Given a text prompt with optional starting image, Ovi generates a video with multi person dialogue.
Loading...
Loading...
Loading...
Loading...
Loading...
Sound effect (SFX) AV Generation from Text w or w/o Image (TI2AV or T2AV)
Given a text prompt with optional starting image, Ovi generates a video with high-quality sound effects.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Music AV Generation from Text w or w/o Image (TI2AV or T2AV)
Given a text prompt with optional starting image, Ovi generates a video with music.
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Comparison with Veo3
We compare our Ovi with Veo3, the state-of-the-art text-to-video model that can generate videos with sound. Please enable audio. Hover over the video to reveal the text prompt.

Veo3

Ovi (Ours)

Loading...
placeholder
Loading...
placeholder
Loading...
placeholder
Loading...
placeholder
Loading...
placeholder
Loading...
placeholder
Loading...
placeholder
Loading...
placeholder
Ovi Team

Project and Team Leader Weimin Wang

Modeling and designs Chetwin Low, Weimin Wang

Codebase for training and data pipeline Chetwin Low, Weimin Wang, Calder Katyal

Data Calder Katyal, Yi Cui, Diego De La Torre, Manav Shah

Limitations
All models have limits, including Ovi
Ethical Considerations

The reference images are sourced from public domains or generated by AI models, and are intended solely to demonstrate the capabilities of this research. If there are any concerns, please contact us (weiminwang@character.ai) and we will delete them.

BibTeX