TalkingMachines

Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models
Chetwin Low* Weimin Wang*
Character.AI
*Equal contribution

We introduce the first framework to transform a large-scale video foundation model (18B parameters) into a real-time streaming system for audio-driven avatar animation, toggling between speaking and listening modes for infinite turns, enabling immersive and interactive FaceTime experiences from diverse input image styles.

All videos are recorded live.

1. We leverage a powerful pretrained image-to-video generative model, adapting it into an audio-driven avatar animator that generalizes seamlessly across diverse image styles.

2. We employ Asymmetric Distribution Matching Distillation with a bidirectional teacher to compress the model into a causal, sparse-attention architecture—enabling real-time streaming capability for FaceTime applications.

DMD2 Training Diagram

3. We introduce system-level optimizations that run the score network and VAE model on separate CUDA streams, allowing our 18B-parameter video generation model to operate in real-time.

Runtime Analysis Diagram

4. Our model demonstrates seamless integration with mainstream audio large language models and WebRTC streaming services (LiveKit), enabling real-time, interactive FaceTime-style experiences across desktop and mobile platforms through a unified framework.