MotionLCM-V2: Improved Compression Rate for Multi-Latent-Token Diffusion
Contributors: Wenxun Dai, Ling-Hao Chen, Yufei Huo, Jingbo Wang, Jinpeng Liu, Bo Dai, Yansong Tang
Expected reading time: 5-10 minutes.
It has been nearly seven months since MotionLCM-V1 was released (May 1, 2024). Today, we release MotionLCM-V2, a state-of-the-art text-to-motion model in motion generation quality, motion-text alignment capability, and inference speed. In addition, we are also releasing MLD++, which brings a substantial performance improvement compared to the original MLD. Our codes are available at https://github.com/Dai-Wenxun/MotionLCM.
Figure 1: Comparison of inference time costs on HumanML3D.
The essence of MotionLCM is to accelerate inference through latent consistency distillation from the teacher model, MLD. Therefore, the effectiveness of MotionLCM is ultimately bounded by the generation capability of the distilled teacher. As a result, the key challenge to upgrading MotionLCM is improving the generation performance of MLD. To achieve this goal, we conducted TWO key explorations on the motion latent diffusion framework.
1. Eliminating structural defects in the original network architecture
In the denoising transformer, the original MLD architecture leverages stacked transformer encoder layers with a skip-connection structure to improve modeling capability. Its self-attention module incorporates three distinct token types: (1) latent tokens derived from the VAE encoder, (2) sentence-level text feature, and (3) diffusion time step embedding. Within the network, we identified two structural defects:
(i) Unlike other tokens, the VAE latent tokens are DIRECTLY fed into the self-attention module without passing through a learnable linear layer. The specific reason here is that the VAE latent tokens have a feature dimension of 256, which is consistent with the hidden dimension of the self-attention module. Therefore, dimensional adjustment is omitted. However, this bypass means that the VAE latent tokens are not modulated to better handle multimodal signals in the self-attention, potentially making their integration into the model less effective.
(ii) The text feature passes through a ReLU activation function FIRST before the learnable linear layer. This ReLU function suppresses negative values, leading to the loss of valuable textual information encoded in these negative components.
To rectify these structural flaws, we define two types of operations.
- Op1: introduces a trainable linear layer after the VAE latent tokens to enhance multimodal signal modulation.
- Op2: removes the unnecessary ReLU activation function to preserve negative components in the text feature.
Figure 2: Impact of Op1 and Op2 on MLD performance.
As shown in Figure 2, Op1 significantly improves both motion generation quality (FID) and motion-text alignment capability (R-Precision Top1), while Op2 demonstrates that preserving the negative information filtered out by the ReLU activation function is essential for enhancing text alignment (Here, we observe that using activation functions like SiLU, which preserve negative values, achieves the same effect.). We obtain the results using the VAE checkpoint provided by the authors of MLD and train MLDs using our custom training settings. These two simple yet effective operations can have a significant impact on the generation performance of MLD and are adopted in our subsequent exploratory experiments.
2. Enable multi-latent-token learning for high-performance diffusion
The success of the motion latent diffusion paradigm fundamentally relies on the achieved perceptual compression of the first-stage VAE, i.e., removing high-frequency motion details while preserving essential semantic information. This enables the second-stage MLD to focus on learning the semantic and conceptual composition of the motion data, i.e., semantic compression. This principle has also been validated in Stable Diffusion.
Therefore, the key to improving the motion generation performance of MLD lies in obtaining an optimal latent space. Additionally, the latent space must carefully balance a suitable compression rate (i.e., the size of the latent space) with the preservation of critical semantic information, allowing the MLD to utilize semantically rich latent representations for high-quality motion generation.
Figure 3: Overview of VAE encoder and diffusion latent sampling.
As shown in Figure 3, in the vanilla MLD, the VAE encoder performs motion compression using the learnable Gaussian distribution parameters (i.e., and ) to fuse the projected pose features. It then samples latent tokens from the Gaussian distribution for the next stage of latent diffusion. The hidden dimension of the VAE encoder is denoted as and the number of latent tokens is . This leads to the final compression rate of .
Figure 4: Comparison of FID scores between MLD and VAE models across different VAE latent sizes.
However, as shown in Figure 4, in the original MLD, increasing the number of latent tokens improves motion reconstruction precision, but it leads to an unstable decline in motion generation capability, which is indicated by the green dashed line. We attribute this to the uncontrolled compression rate, where increasing the number of latent tokens directly leads to a continuous decline in the compression rate (i.e., 1x256→10x256). This is because the original MLD samples latent tokens directly from the Gaussian distribution encoded by the VAE encoder. The uncontrolled compression rate results in leaving most of the perceptual compression to the diffusion model, thus hindering the ability to generate high-quality motions. These issues result in the original MLD being limited to single-latent-token learning (i.e., 1x256) for diffusion training, leading to a lower upper bound on generation performance compared to using VAEs with multi-latent tokens (e.g., 2x256, etc). Therefore, our research will focus on how to enable multi-latent-token learning for high-performance diffusion.
Figure 5: Method overview of B2A-HDM.
As a compromise solution to resolve this issue, B2A-HDM adopts a hierarchical diffusion model to bypass the above challenge, avoiding reliance on one single well-structured latent space. Specifically, B2A-HDM first utilizes the MLD with a small VAE latent size (i.e., LD-LS) to generate an intermediate denoised latent, which is then decoded into a motion sequence that aligns with the textual description. Subsequently, it leverages another VAE with a large latent size (i.e., HD-LS) to encode the motion into a high-dimensional latent space. The model then employs multiple denoisers to perform staged denoising, resulting in high-quality, detail-rich motion generation. However, the decoding and encoding processes from low-dimensional to high-dimensional latent space inevitably lead to the accumulation of errors, while the multi-denoiser framework adds more complexity to both the training and inference stages.
Our solution is more straightforward. As shown in Figure 3(b), in MLD++, we add a linear layer as the latent adapter to adapt the dimension of the embedded distribution parameters to directly control the size of the latent space . This elegant design enables us to harness the strong compression capability of multi-latent tokens while maintaining control over the compression rate, thereby providing a more compact latent space for the subsequent diffusion stage. We employ the latent adapter for MLD++ except when the expected latent space size and the latent token dimensions are consistent (e.g., 1x256 and 2x256).
Figure 6: VAE and MLD++ FID score curves under different compression rates.
As shown in Figure 6, under the same compression rate, as the number of latent tokens increases, the motion reconstruction precision of VAE improves and additionally enhances the generation performance of MLD++ as we expected. By controlling the compression rate using the latent adapter, we enable multi-latent-token learning for high-performance diffusion.
Table 1: VAE and MLD++ performance for different latent sizes (i.e., compression rates).
As shown in Table 1, increasing the number of latent tokens slightly increases the inference time AITS but remains within an acceptable range (0.2s~0.3s). Moreover, increasing the number of latent tokens introduces fluctuations in motion-text matching performance. Considering the motion generation quality (FID) and motion-text alignment capability (R-Precision Top1) of MLD++, we select the best FID checkpoint of MLD++ (16x32) to distill our MotionLCM-V2.
Table 2: Comparison of text-conditional motion synthesis on HumanML3D dataset.
As shown in Table 2, unlike B2A-HDM, which relies on an overcomplicated multi-denoiser framework, MLD++ surpasses B2A-HDM by a large margin while using only one single denoiser. Thanks to the powerful MLD++, the distillation performance of MotionLCM-V2 has also been significantly improved compared to MotionLCM-V1, further advancing the state of text-to-motion generation by excelling in inference speed, motion generation quality, and text alignment capability.
Citation
@inproceedings{motionlcm,
title={Motionlcm: Real-time controllable motion generation via latent consistency model},
author={Dai, Wenxun and Chen, Ling-Hao and Wang, Jingbo and Liu, Jinpeng and Dai, Bo and Tang, Yansong},
booktitle={ECCV},
pages={390--408},
year={2025}
}
@misc{motionlcm-v2,
title={Motionlcm-v2: Improved compression rate for multi-latent-token diffusion},
url={https://huggingface.co/blog/wxDai/motionlcm-v2}
author={Dai, Wenxun and Chen, Ling-Hao and Huo, Yufei and Wang, Jingbo and Liu, Jinpeng and Dai, Bo and Tang, Yansong},
month={December},
year={2024}
}