🖼 Gallery

Follow-Your-Motion
Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Yue Ma*, Yulong Liu*, Qiyuan Zhu*, Ayden Yang, Kunyu Feng, Xinhua Zhang, Zhifeng Li,
Sirui Han, Chenyang Qi and Qifeng Chen

🖼 Gallery

We have showcased some results of Follow-Your-Motion .

More results can be found on our Project page.

📖 Introduction

This project is an advanced enhancement based on the Wan2.1 foundation model, specifically optimized for motion-following tasks.

By training custom Temporal LoRAs and Spatial LoRAs on input video sequences, the model learns to capture precise motion dynamics and spatial structures. Once the weights are obtained, our inference pipeline generates new videos that faithfully "follow" the movements of the reference source.

🚀 Key Features

Dual-LoRA Training: Support for training both Temporal and Spatial LoRAs to achieve superior motion fidelity.
Motion Transfer: Seamlessly transfer complex actions from reference videos to your generated content.
Optimized for Consumer GPUs: We highly recommend using the Wan2.1-T2V-1.3B model, which is optimized for efficiency without sacrificing quality.
Memory Efficiency: Integrated with Tiled VAE technology and memory-offloading strategies.

⚙️ Setup Environment

# Create Conda Environment
conda create -n Fym python=3.10
conda activate Fym

Requirements

· Python 3.10+

· Pytorch 2.0+

· CUDA 12.x

Using dowload script

Download 1.3B model

python examples/download_model.py --1.3b

📦 Installation & Setup

1. Prerequisites

Recommended Model: Wan2.1-T2V-1.3B (Best balance between performance and VRAM usage).
GPU: NVIDIA RTX 3090/4090 (Recommended for training).
VRAM: ~8.2GB for 1.3B inference, 24GB for full LoRA training.

2. Environment Configuration

Clone the repository and install the dependencies:

# Clone this repo
git clone [https://github.com/maohanw6-afk/Follow-Your-Motion.git](https://github.com/maohanw6-afk/Follow-Your-Motion.git)
cd Follow-Your-Motion

# Install required packages
pip install -r requirements.txt

💻 Usage Guide

Prerequisites

pip install -e .

Step 1: Download the Wan2.1-T2V-1.3B Model

python examples/download_model.py --1.3b

Step 2: Prepare Your Dataset

Before processing the data, you need to organize your training dataset in the following structure:

data/example_dataset/
├── metadata.csv
└── train
    ├── video_00001.mp4

The metadata.csv file should contain the file names and their corresponding text descriptions:

file_name,text
video_00001.mp4,"video description"

Step 3: Process Your Dataset

This step collects attention map information from your dataset:

CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/EffiVMT_train_wan_t2v.py \
  --task data_process \
  --dataset_path examples/wanvideo/data/example_dataset \
  --output_path ./models \
  --text_encoder_path "./models/Wan2.1-T2V-1.3B/models_t5_umt5-xxl-enc-bf16.pth" \
  --vae_path "examples/models/Wan2.1-T2V-1.3B/Wan2.1_VAE.pth" \
  --tiled \
  --num_frames 45 \
  --height 544 \
  --width 544

Step 4: Collect Attention Head Information

CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/Collect_attn_map.py \
  --dataset_path examples/wanvideo/data/example_dataset \
  --dit_path "examples/models/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors"

Step 5: Train Spatial LoRA

Train the spatial attention components of the model:

CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/EffiVMT_train_wan_t2v_Head.py \
  --task train \
  --train_architecture lora \
  --dataset_path examples/wanvideo/data/example_dataset \
  --output_path models/wan_video \
  --out_file_name='lora_spatial' \
  --dit_path "examples/wanvideo/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" \
  --steps_per_epoch 500 \
  --max_epochs 3 \
  --learning_rate 1e-4 \
  --lora_rank 16 \
  --lora_alpha 16 \
  --lora_target_modules "self_attn.q_spatial,self_attn.k_spatial,self_attn.v_spatial" \
  --accumulate_grad_batches 1 \
  --use_gradient_checkpointing \
  --spatial_lora_wd 0.1

Step 6: Train Temporal LoRA

After training the spatial components, train the temporal attention components:

CUDA_VISIBLE_DEVICES="0" python examples/wanvideo/EffiVMT_train_wan_t2v_Head.py \
  --task train \
  --train_architecture lora \
  --dataset_path examples/wanvideo/data/example_dataset \
  --out_file_name='lora_temporal' \
  --dit_path "examples/wanvideo/Wan2.1-T2V-1.3B/diffusion_pytorch_model.safetensors" \
  --steps_per_epoch 500 \
  --max_epochs 3 \
  --learning_rate 1e-4 \
  --lora_rank 16 \
  --lora_alpha 16 \
  --lora_target_modules "self_attn.q_temporal,self_attn.k_temporal,self_attn.v_temporal" \
  --accumulate_grad_batches 1 \
  --use_gradient_checkpointing \
  --train_temporal_lora \
  --pretrained_spatial_lora_path=models/wan_video/lora_spatial.ckpt \
  --temporal_lora_wd 0.5

Step 7: Begin Video Generation

python examples/wanvideo/inference.py

📁 Project Structure

Click for directory structure

Follow-Your-Motion/
├── diffsynth/                  # Core library
│   ├── models/                 # Model implementations
│   │   ├── wan_video_dit.py    # Modified DiT with Q/K extraction
│   │   ├── wan_video_vae.py    # Video VAE encoder/decoder
│   │   └── wan_video_text_encoder.py
│   ├── pipelines/              # Inference pipelines
│   │   └── wan_video.py        # Pipeline with AMF implementation
│   ├── schedulers/             # Noise schedulers (Flow Matching)
│   ├── prompters/              # Prompt processing
│   └── vram_management/        # Memory optimization utilities
├── examples/                   # Example scripts
│   └── data/example_dataset/
│       ├── metadata.csv
│       └── train/
│           └── video_00001.mp4
├── models/                     # Model checkpoints
├── requirements.txt            # Dependencies
└── setup.py                    # Package setup

💡 Notes

Single-GPU Support: This project is single-GPU compatible, allowing you to run the entire training and inference pipeline on a single graphics card.
Two-Phase Training: The training process is systematically split into two phases: Spatial Attention training and Temporal Attention training.
Hardware Configuration: Adjust the GPU device index in CUDA_VISIBLE_DEVICES as needed to match your local setup.
Parameter Tuning:
- You can modify general parameters such as learning rate, number of epochs, and LoRA settings based on your specific needs.
- Specifically, feel free to adjust max_epochs, spatial_lora_wd, and temporal_lora_wd to optimize performance according to your hardware constraints and video content.
Checkpoint Saving: All trained model checkpoints will be automatically saved to the output directories specified in your configuration.

👨‍👩‍👧‍👦 Follow Family

Follow-Your-Pose: Pose-Guided text-to-Video Generation.

Follow-Your-Click: Open-domain Regional image animation via Short Prompts.

⭐️ Star History

🎼 Citation

If you think this project is helpful, please feel free to leave a star⭐️⭐️⭐️ and cite our paper:

@article{ma2025follow,
  title={Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning},
  author={Ma, Yue and Liu, Yulong and Zhu, Qiyuan and Yang, Ayden and Feng, Kunyu and Zhang, Xinhua and Li, Zhifeng and Han, Sirui and Qi, Chenyang and Chen, Qifeng},
  journal={arXiv preprint arXiv:2506.05207},
  year={2025}
}

📜 License

This project is open source and licensed under the MIT License. See LICENSE.md for details.

💗 Acknowledgements

This repository borrows heavily from DiffSynth-Studio and Wan Video. Thanks to the authors for sharing their code and models.

🧿 Maintenance

This is the codebase for our research work. If you have any questions or ideas to discuss, feel free to open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
assets		assets
diffsynth.egg-info		diffsynth.egg-info
diffsynth		diffsynth
diffsynth_collect_attn		diffsynth_collect_attn
examples/wanvideo		examples/wanvideo
lightning_logs		lightning_logs
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Follow-Your-Motion
Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

🖼 Gallery

📖 Introduction

🚀 Key Features

⚙️ Setup Environment

Requirements

Using dowload script

Download 1.3B model

📦 Installation & Setup

1. Prerequisites

2. Environment Configuration

💻 Usage Guide

Prerequisites

Step 1: Download the Wan2.1-T2V-1.3B Model

Step 2: Prepare Your Dataset

Step 3: Process Your Dataset

Step 4: Collect Attention Head Information

Step 5: Train Spatial LoRA

Step 6: Train Temporal LoRA

Step 7: Begin Video Generation

📁 Project Structure

💡 Notes

👨‍👩‍👧‍👦 Follow Family

⭐️ Star History

🎼 Citation

📜 License

💗 Acknowledgements

🧿 Maintenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Follow-Your-Motion Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

🖼 Gallery

📖 Introduction

🚀 Key Features

⚙️ Setup Environment

Requirements

Using dowload script

Download 1.3B model

📦 Installation & Setup

1. Prerequisites

2. Environment Configuration

💻 Usage Guide

Prerequisites

Step 1: Download the Wan2.1-T2V-1.3B Model

Step 2: Prepare Your Dataset

Step 3: Process Your Dataset

Step 4: Collect Attention Head Information

Step 5: Train Spatial LoRA

Step 6: Train Temporal LoRA

Step 7: Begin Video Generation

📁 Project Structure

💡 Notes

👨‍👩‍👧‍👦 Follow Family

⭐️ Star History

🎼 Citation

📜 License

💗 Acknowledgements

🧿 Maintenance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Follow-Your-Motion
Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning

Packages