Humanoid Diffusion Controller

HDC

Yunshen Wang*^1,3Shaohang Zhu*^1,4Jingze Zhang^1,5Jiaxin Li^1,6Yixuan Li^1,6Tengyu Liu^1,2Siyuan Huang^1,2

¹Beijing Institute for General Artificial Intelligence (BIGAI)²Joint Laboratory of Embodied AI and Humanoid Robots, BIGAI & UniTree Robotics³Beijing University of Posts and Telecommunications⁴Xi’an Jiaotong University⁵Tsinghua University⁶Beijing Institute of Technology*Indicates Equal Contribution

paper arxiv(soon)code(soon)video

Video

Lots of continuous high-difficulty motions, including the first Webster flip ever achieved on the Unitree G1, within a unified policy.

Continuous large-range downward chopping motions performed on the Unitree H1

Abstract

We introduce the Humanoid Diffusion Controller (HDC), the first diffusion-based generative controller for real-time whole-body control of humanoid robots. Unlike conventional online reinforcement learning (RL) approaches, HDC learns from large-scale offline data and leverages a Diffusion Transformer to generate temporally coherent action sequences. This design provides high expressiveness, scalability, and temporal smoothness. To support training at scale, we propose an effective data collection pipeline and training recipe that avoids costly online rollouts while enabling robust deployment in both simulated and real-world environments. Extensive experiments demonstrate that HDC outperforms state-of-the-art online RL methods in motion tracking accuracy, behavioral quality, and generalization to unseen motions. These findings underscore the feasibility and potential of large-scale generative modeling as a scalable and effective paradigm for generalizable and high-quality humanoid robot control.

Diverse Motions by One HDC model

Walk fast

Chop downward(unseen)

Freestyle swim(unseen)

Arm-swing dance

Punch

Walk

Boat rowing

Dance

Full-body rotation

Long-horizon Motions with Smoothness

Play the violin(unseen)

Wipe the window(unseen)

Mixture of motions

Bird-like flying(unseen)

Mixture of motions(unseen)

High-dynamic Motions

Double baseball swing(unseen)

Swing a hoop

Precise Motion Tracking

Punch

Breaststroke swim(unseen)

Method

HDC training and inference. (a) During training, future action sequence $\mathbf{A}_t$ and past observation sequence $\mathbf{O}_t$ are sampled from offline data. Each action in $\mathbf{A}_t$ is perturbed with a different noise level, and the DiT model, conditioned on $\mathbf{O}_t$ , learns to denoise the entire action sequence.(b) During inference, each element in the action buffer is initialized with linearly increasing noise. The denoiser iteratively denoise the buffer until the first action becomes clean, which is then executed and removed from the buffer. A new purely noisy action is appended to the end, and the process repeats recursively.

Quantitative Results

Comparison of HDC and baseline methods on motion tracking. HDC consistently outperforms the online RL-based baseline, demonstrating its capability to acquire high-dynamic control skills from offline data while maintaining robustness against dynamics uncertainty, observation noise, and compounding errors. Notably, HDC surpasses the oracle policy in terms of success rate, highlighting its strong multi-skill learning ability. We attribute this advantage to the model's high capacity and its ability to represent multi-modal action distributions, without being limited to a single behavior mode.

Citation

@article{
  coming soon
}