TalkingMachines

Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models

Chetwin Low^* Weimin Wang^*
Character.AI

^*Equal contribution

We introduce the first framework to transform a large-scale video foundation model (18B parameters) into a real-time streaming system for audio-driven avatar animation, toggling between speaking and listening modes for infinite turns, enabling immersive and interactive FaceTime experiences from diverse input image styles.

All videos are recorded live. All frames (including `talking` and `listening` modes) are generated in real-time.

Fox Character in Nature

Animated Fox Avatar

Interview Style Demo

Game Play

Role Play Character

House Cat Character

Mobile experiences can also work quite well

Demon (Mobile Demo)

Miyuki (Mobile Demo)

1. We leverage a powerful pretrained image-to-video generative model, adapting it into an audio-driven avatar animator that generalizes seamlessly across diverse image styles.

2. We employ Asymmetric Distribution Matching Distillation with a bidirectional teacher to compress the model into a causal, sparse-attention architecture—enabling real-time streaming capability for FaceTime applications.

3. We introduce system-level optimizations that run the score network and VAE model on separate CUDA streams, allowing our 18B-parameter video generation model to operate in real-time.

4. Our model demonstrates seamless integration with mainstream audio large language models and WebRTC streaming services (LiveKit), enabling real-time, interactive FaceTime-style experiences across desktop and mobile platforms through a unified framework.