RigMo: Unifying Rig and Motion Learning for Generative Animation

Hao Zhang1,2 Jiahao Luo1,3 Bohui Wan2 Yizhou Zhao1,4 Zongrui Li5 Michael Vasilkovsky1 Chaoyang Wang1 Jian Wang1 Narendra Ahuja2 Bing Zhou1
1Snap Inc.    2UIUC    3UCSC    4CMU    5NTU
arXiv 2026
arXiv Code (Coming Soon) Data (Coming Soon)
TLDR: Rigging and motion should not be learned in isolation!

RigMo is the first generative framework that discovers both rig structure and motion dynamics directly from raw mesh sequences without ground-truth rigs, skeletons, or per-sequence optimization.

By factorizing deformation into explicit Gaussian bones and structure-aware motion, RigMo turns arbitrary deforming meshes into fully animatable assets: feed-forward, interpretable, and scalable across categories.

RigMo Framework
Figure 1. RigMo infers Gaussian bones, skinning weights, and motion parameters directly from mesh sequences. Colors visualize bone influence on vertices.

Abstract

Despite significant progress in 4D generation, rig and motion—the core structural and dynamic components of animation—are typically modeled as separate problems. Existing pipelines rely on ground-truth skeletons and skinning weights for motion generation and treat auto-rigging as an independent process.

RigMo presents a unified generative framework that jointly learns rig and motion directly from raw mesh sequences, without any human-provided rig annotations. By encoding per-vertex deformations into compact latent spaces, RigMo decodes explicit Gaussian bones and time-varying transformations to create fully animatable meshes.

Experiments on DeformingThings4D, Objaverse-XL, and TrueBones demonstrate that RigMo learns smooth, interpretable, and physically plausible rigs while achieving superior reconstruction and generalization.

Method

Method Overview
Figure 2. Overview of the RigMo-VAE framework. Given temporal vertex trajectories from deforming mesh sequences, RigMo employs a dual-path encoder to disentangle static geometry (rigging branch) and dynamic motion (motion branch), learning a compact latent representation that captures both spatial structure and temporal dynamics. The decoder maps these latent features to physically interpretable rig components: Gaussian bone descriptors defining geodesic-aware skinning weights and variational motion parameters for local and root transformations. Different colors indicate the influence regions of learned Gaussian bones, demonstrating semantically meaningful decomposition of mesh deformation without manual rigging supervision.
Motion DiT
Figure 3. Overview of the Motion DiT. Given static rigging features, a condition encoder produces anchor and global tokens that guide a diffusion transformer operating in RigMo's motion-latent space. The model uses spatial, temporal, and frame-conditioned cross-attention to predict denoised motion latents, which are decoded into bone transformations and vertex sequences via Gaussian skinning.

Results

Results
Figure 4. Results produced by the full RigMo. Given a sparse input sequence, where a subset of frames is observed according to a frame mask, RigMo reconstructs a complete animatable model by jointly predicting the rigging structure (Gaussian bones and skinning weights) and synthesizing the missing motion frames through diffusion in the RigMo latent space. The resulting rigged model produces coherent, articulated motion across humans, animals, and diverse non-human shapes, demonstrating that sparse observations are sufficient to recover a full animation without category-specific priors.
Comparison
Figure 5. Comparison between UniRig+Optimization and our RigMo Rigging Module. Although UniRig may produce visually plausible skinning weights in some cases (e.g., the fox), its rigging does not generalize and collapses under actual animation, leading to severe deformation artifacts. In contrast, RigMo learns robust and transferable rig structures directly from motion, without any ground-truth rig supervision, and achieves stable, high-fidelity deformations across diverse poses and animal species.
Ablation
Figure 6. Side-by-side skinning weights comparisons of the 48-token and 128-token configurations.

BibTeX

@article{zhang2026rigmo,
  title={RigMo: Unifying Rig and Motion Learning for Generative Animation},
  author={Zhang, Hao and Luo, Jiahao and Wan, Bohui and Zhao, Yizhou and 
          Li, Zongrui and Vasilkovsky, Michael and Wang, Chaoyang and 
          Wang, Jian and Ahuja, Narendra and Zhou, Bing},
  journal={arXiv preprint arXiv:2601.06378},
  year={2026}
}