VOID: Video Object and Interaction Deletion

Saman Motamed^1,2, William Harvey¹, Benjamin Klein¹, Luc Van Gool², Zhuoning Yuan¹, Ta-Ying Cheng¹

¹Netflix ²INSAIT, Sofia University "St. Kliment Ohridski"

VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed. It is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning.

Example: If a person holding a guitar is removed, VOID also removes the person's effect on the guitar — causing it to fall naturally.

teaser-with-name.mp4

🤖 Models

VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.

Model	Description	HuggingFace
VOID Pass 1	Base inpainting model	Download
VOID Pass 2	Warped-noise refinement model	Download

Place checkpoints anywhere and pass the path via --config.video_model.transformer_path (Pass 1) or --model_checkpoint (Pass 2).

▶️ Quick Start

The fastest way to try VOID is the included notebook — it handles setup, downloads the models, runs inference on a sample video, and displays the result:

Note: Requires a GPU with 40GB+ VRAM (e.g., A100).

For more control over the pipeline (custom videos, Pass 2 refinement, mask generation), see the full setup and instructions below.

⚙️ Setup

pip install -r requirements.txt

Stage 1 of the mask pipeline uses Gemini via the Google AI API. Set your API key:

export GEMINI_API_KEY=your_key_here

Also install SAM2 separately (required for mask generation):

git clone https://github.com/facebookresearch/sam2.git
cd sam2 && pip install -e .

Download the pretrained base inpainting model from HuggingFace:

hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
    --local-dir ./CogVideoX-Fun-V1.5-5b-InP

The inference and training scripts expect it at ./CogVideoX-Fun-V1.5-5b-InP relative to the repo root by default.

If ffmpeg is not available on your system, you can use the binary bundled with imageio-ffmpeg:

ln -sf $(python -c "import imageio_ffmpeg; print(imageio_ffmpeg.get_ffmpeg_exe())") ~/.local/bin/ffmpeg

📁 Expected directory structure

After cloning the repo and downloading all assets, your directory should look like this:

VOID/
├── config/
├── datasets/
│   └── void_train_data.json
├── inference/
├── sample/                         # included sample sequences for inference
├── scripts/
├── videox_fun/
├── VLM-MASK-REASONER/
├── README.md
├── requirements.txt
│
├── CogVideoX-Fun-V1.5-5b-InP/     # hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP
├── void_pass1.safetensors          # download from huggingface.co/void-model (see Models above)
├── void_pass2.safetensors          # download from huggingface.co/void-model (see Models above)
├── training_data/                  # generated via data_generation/ pipeline (see Training section)
└── data_generation/                # data generation code (HUMOTO + Kubric pipelines)

📂 Input Format

Each video sequence lives in its own folder under a root data directory:

data_rootdir/
└── my-video/
    ├── input_video.mp4      # source video
    ├── quadmask_0.mp4       # quadmask (4-value mask video, see below)
    └── prompt.json          # {"bg": "background description"}

The prompt.json contains a single "bg" key describing the scene after the object has been removed — i.e. what you want the background to look like. Do not describe the object being removed; describe what remains.

{ "bg": "A table with a cup on it." }         // ✅ describes the clean background
{ "bg": "A person being removed from scene." } // ❌ don't describe the removal

A few examples from the included samples:

Sequence	Removed object	`bg` prompt
`lime`	the glass	`"A lime falls on the table."`
`moving_ball`	the rubber duckie	`"A ball rolls off the table."`
`pillow`	the kettlebell being placed on the pillow	`"Two pillows are on the table."`

The quadmask encodes four semantic regions per pixel:

Value	Meaning
`0`	Primary object to remove
`63`	Overlap of primary + affected regions
`127`	Affected region (interactions: falling objects, displaced items, etc.)
`255`	Background (keep)

🚀 Pipeline

🎭 Stage 1 — Generate Masks

The VLM-MASK-REASONER/ pipeline generates quadmasks from raw videos using SAM2 segmentation and a VLM (Gemini) for reasoning about interaction-affected regions.

🖱️ Step 0 — Select points (GUI)

python VLM-MASK-REASONER/point_selector_gui.py

Load a JSON config listing your videos and instructions, then click on the objects to remove. Saves a *_points.json with the selected points.

Config format:

{
  "videos": [
    {
      "video_path": "path/to/video.mp4",
      "output_dir": "path/to/output/folder",
      "instruction": "remove the person"
    }
  ]
}

⚡ Steps 1–4 — Run the full pipeline

After saving the points config, run all remaining stages automatically:

bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json

Optional flags:

bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json \
    --sam2-checkpoint path/to/sam2_hiera_large.pt \
    --device cuda

This runs the following stages in order:

Stage	Script	Output
1 — SAM2 segmentation	`stage1_sam2_segmentation.py`	`black_mask.mp4`
2 — VLM analysis	`stage2_vlm_analysis.py`	`vlm_analysis.json`
3 — Grey mask generation	`stage3a_generate_grey_masks_v2.py`	`grey_mask.mp4`
4 — Combine into quadmask	`stage4_combine_masks.py`	`quadmask_0.mp4`

The final quadmask_0.mp4 in each video's output_dir is ready to use for inference.

🎬 Stage 2 — Inference

VOID inference runs in two passes. Pass 1 is sufficient for most videos; Pass 2 adds a warped-noise refinement step for better temporal consistency on longer clips.

✨ Pass 1 — Base inference

python inference/cogvideox_fun/predict_v2v.py \
    --config config/quadmask_cogvideox.py \
    --config.data.data_rootdir="path/to/data_rootdir" \
    --config.experiment.run_seqs="my-video" \
    --config.experiment.save_path="path/to/output" \
    --config.video_model.model_name="path/to/CogVideoX-Fun-V1.5-5b-InP" \
    --config.video_model.transformer_path="path/to/void_pass1.safetensors"

To run multiple sequences at once, pass a comma-separated list:

--config.experiment.run_seqs="video1,video2,video3"

Key config options:

Flag	Default	Description
`--config.data.sample_size`	`384x672`	Output resolution (HxW)
`--config.data.max_video_length`	`197`	Max frames to process
`--config.video_model.temporal_window_size`	`85`	Temporal window for multidiffusion
`--config.video_model.num_inference_steps`	`50`	Denoising steps
`--config.video_model.guidance_scale`	`1.0`	Classifier-free guidance scale
`--config.system.gpu_memory_mode`	`model_cpu_offload_and_qfloat8`	Memory mode (`model_full_load`, `model_cpu_offload`, `sequential_cpu_offload`)

The output is saved as <save_path>/<sequence_name>.mp4, along with a *_tuple.mp4 side-by-side comparison.

🔁 Pass 2 — Warped noise refinement

Uses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency.

Single video:

python inference/cogvideox_fun/inference_with_pass1_warped_noise.py \
    --video_name my-video \
    --data_rootdir path/to/data_rootdir \
    --pass1_dir path/to/pass1_outputs \
    --output_dir path/to/pass2_outputs \
    --model_checkpoint path/to/void_pass2.safetensors \
    --model_name path/to/CogVideoX-Fun-V1.5-5b-InP

Batch: Edit the video list and paths in inference/pass_2_refine.sh, then run:

bash inference/pass_2_refine.sh

Key arguments:

Argument	Default	Description
`--pass1_dir`	—	Directory containing Pass 1 output videos
`--output_dir`	`./inference_with_warped_noise`	Where to save Pass 2 results
`--warped_noise_cache_dir`	`./pass1_warped_noise_cache`	Cache for precomputed warped latents
`--temporal_window_size`	`85`	Temporal window size
`--height` / `--width`	`384` / `672`	Output resolution
`--guidance_scale`	`6.0`	CFG scale
`--num_inference_steps`	`50`	Denoising steps
`--use_quadmask`	`True`	Use quadmask conditioning

✏️ Stage 3 — Manual Mask Refinement (Optional)

If the auto-generated quadmask does not accurately capture the object or its interaction region, use the included GUI editor to refine it before running inference.

python VLM-MASK-REASONER/edit_quadmask.py

Open a sequence folder containing input_video.mp4 (or rgb_full.mp4) and quadmask_0.mp4. The editor shows the original video and the editable mask side by side.

Tools:

Grid Toggle — click a grid cell to toggle the interaction region (127 ↔ 255)
Grid Black Toggle — click a grid cell to toggle the primary object region (0 ↔ 255)
Brush (Add / Erase) — freehand paint or erase mask regions at pixel level
Copy from Previous Frame — propagate the black or grey mask from the previous frame

Keyboard shortcuts: ← / → navigate frames, Ctrl+Z / Ctrl+Y undo/redo.

Save overwrites quadmask_0.mp4 in place. Rerun inference from Pass 1 after saving.

🏋️ Training

Training Data Generation

Due to licensing constraints on the underlying datasets, we release the data generation code instead of the pre-built training data. The code produces paired counterfactual videos (with/without object, plus quad-masks) from two sources:

Source 1: HUMOTO (Human-Object Interaction)

Generates counterfactual videos from the HUMOTO motion capture dataset using Blender. A human (Remy/Sophie character) interacts with objects; removing the human causes objects to fall via physics simulation.

Prerequisites:

HUMOTO dataset — Request access from the authors at adobe-research/humoto. Once approved, download and place under data_generation/humoto_release/
Blender — Install Blender (tested with 3.x and 4.x). Also install opencv-python-headless in Blender's Python (see data_generation/README.md)
Remy & Sophie characters — Download from Mixamo (free Adobe account). Search for "Remy" and "Sophie", download each as FBX, and place at:
```
data_generation/human_model/Remy_mixamo_bone.fbx
data_generation/human_model/Sophie_mixamo_bone.fbx
```
PBR textures (optional) — Download texture packs from ambientCG or Poly Haven. Without textures, objects render with realistic solid colors as fallback

Expected directory structure after setup:

data_generation/
├── humoto_release/
│   ├── humoto_0805/                    # HUMOTO sequences (.pkl, .fbx, .yaml per sequence)
│   └── humoto_objects_0805/            # Object meshes (.obj, .fbx per object)
├── human_model/
│   ├── Remy_mixamo_bone.fbx            # ← download from Mixamo
│   ├── Sophie_mixamo_bone.fbx          # ← download from Mixamo
│   ├── bone_names.py                   # included
│   └── *.json                          # included (bone structure definitions)
├── textures/                           # ← optional, user-provided PBR textures
├── physics_config.json                 # included (manual per-sequence physics settings)
├── render_paired_videos_blender_quadmask.py   # main renderer
├── convert_split_remy_sophie.sh               # character conversion script
└── ...

Pipeline:

cd data_generation

# 1. Convert HUMOTO sequences to Remy/Sophie characters
bash convert_split_remy_sophie.sh

# 2. Render paired videos (with human, without human, quad-mask)
blender --background --python render_paired_videos_blender_quadmask.py -- \
    -d ./humoto_release/humoto_0805 \
    -o ./output \
    -s <sequence_name> \
    -m ./humoto_release/humoto_objects_0805 \
    --use_characters --enable_physics --add_walls \
    --target_frames 60 --fps 12

A pre-configured physics_config.json is included specifying which objects are static vs. dynamic per sequence. See data_generation/README.md for full details.

Source 2: Kubric (Object-Only Interaction)

Generates counterfactual videos using Kubric with Google Scanned Objects. Objects are launched at a target; removing them alters the target's physics trajectory. No external dataset download required — assets are fetched from Google Cloud Storage.

cd data_generation
pip install kubric pybullet imageio imageio-ffmpeg

python kubric_variable_objects.py --num_pairs 200 --resolution 384

Training Data Format

Both pipelines output the same format expected by the training scripts:

training_data/
└── sequence_name/
    ├── rgb_full.mp4       # input video (with object)
    ├── rgb_removed.mp4    # target video (object removed, physics applied)
    ├── mask.mp4           # quad-mask (0/63/127/255)
    └── metadata.json

Point the training scripts at your generated data by updating datasets/void_train_data.json.

Running Training

Training proceeds in two stages. Pass 1 is trained first, then Pass 2 fine-tunes from that checkpoint.

Pass 1 — Base inpainting model

Does not require warped noise. Trains the model to remove objects and their interactions from scratch.

bash scripts/cogvideox_fun/train_void.sh

Key arguments:

Argument	Description
`--pretrained_model_name_or_path`	Path to base CogVideoX inpainting model
`--transformer_path`	Optional starting checkpoint
`--train_data_meta`	Path to dataset metadata JSON
`--train_mode="void"`	Enables void inpainting training mode
`--use_quadmask`	Trains with 4-value quadmask conditioning
`--use_vae_mask`	Encodes mask through VAE
`--output_dir`	Where to save checkpoints
`--num_train_epochs`	Number of epochs
`--checkpointing_steps`	Save a checkpoint every N steps
`--learning_rate`	Default `1e-5`

Pass 2 — Warped noise refinement model

Continues training from a Pass 1 checkpoint with optical flow-warped latent initialization, improving temporal consistency on longer videos. Requires warped noise for training data to be present.

bash scripts/cogvideox_fun/train_void_warped_noise.sh

Set TRANSFORMER_PATH to your Pass 1 checkpoint before running:

TRANSFORMER_PATH=path/to/pass1_checkpoint.safetensors bash scripts/cogvideox_fun/train_void_warped_noise.sh

Additional arguments specific to this stage:

Argument	Description
`--use_warped_noise`	Enables warped latent initialization during training
`--warped_noise_degradation`	Noise blending factor (default `0.3`)
`--warped_noise_probability`	Fraction of steps using warped noise (default `1.0`)

Training was run on 8× A100 80GB GPUs using DeepSpeed ZeRO stage 2.

🤩 Community Adoption

We are excited to see the community build on VOID!
Below we showcase selected demos, tools, and extensions.

If you’ve built something using VOID, feel free to submit a PR to add it here.

🌐 Demos & Projects

⭐ Gradio Demo — @sam-motamed
Interactive demo for trying VOID in the browser:
👉 https://huggingface.co/spaces/sam-motamed/VOID

🙏 Acknowledgements

This implementation builds on code and models from aigc-apps/VideoX-Fun, Gen-Omnimatte, Go-with-the-Flow, Kubric and HUMOTO. We thank the authors for sharing the codes and pretrained inpainting models for CogVideoX, Gen-Omnimatte, and the optical flow warping utilities.

📄 Citation

If you find our work useful, please consider citing:

🔗 https://arxiv.org/abs/2604.02296

@misc{motamed2026void,
  title={VOID: Video Object and Interaction Deletion},
  author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
  year={2026},
  eprint={2604.02296},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02296}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VOID: Video Object and Interaction Deletion

Saman Motamed^1,2, William Harvey¹, Benjamin Klein¹, Luc Van Gool², Zhuoning Yuan¹, Ta-Ying Cheng¹

¹Netflix ²INSAIT, Sofia University "St. Kliment Ohridski"

🤖 Models

▶️ Quick Start

⚙️ Setup

📂 Input Format

🚀 Pipeline

🖱️ Step 0 — Select points (GUI)

⚡ Steps 1–4 — Run the full pipeline

✨ Pass 1 — Base inference

🔁 Pass 2 — Warped noise refinement

Training Data Generation

Training Data Generation

Source 1: HUMOTO (Human-Object Interaction)

Source 2: Kubric (Object-Only Interaction)

Training Data Format

Running Training

Pass 1 — Base inpainting model

Pass 2 — Warped noise refinement model

🤩 Community Adoption

🌐 Demos & Projects

🙏 Acknowledgements

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
VLM-MASK-REASONER		VLM-MASK-REASONER
assets		assets
config		config
data_generation		data_generation
datasets		datasets
inference		inference
sample		sample
scripts/cogvideox_fun		scripts/cogvideox_fun
videox_fun		videox_fun
LICENSE		LICENSE
README.md		README.md
notebook.ipynb		notebook.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

VOID: Video Object and Interaction Deletion

Saman Motamed1,2, William Harvey1, Benjamin Klein1, Luc Van Gool2, Zhuoning Yuan1, Ta-Ying Cheng1 1Netflix 2INSAIT, Sofia University "St. Kliment Ohridski"

🤖 Models

▶️ Quick Start

⚙️ Setup

📂 Input Format

🚀 Pipeline

🖱️ Step 0 — Select points (GUI)

⚡ Steps 1–4 — Run the full pipeline

✨ Pass 1 — Base inference

🔁 Pass 2 — Warped noise refinement

Training Data Generation

Training Data Generation

Source 1: HUMOTO (Human-Object Interaction)

Source 2: Kubric (Object-Only Interaction)

Training Data Format

Running Training

Pass 1 — Base inpainting model

Pass 2 — Warped noise refinement model

🤩 Community Adoption

🌐 Demos & Projects

🙏 Acknowledgements

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Saman Motamed^1,2, William Harvey¹, Benjamin Klein¹, Luc Van Gool², Zhuoning Yuan¹, Ta-Ying Cheng¹

¹Netflix ²INSAIT, Sofia University "St. Kliment Ohridski"

Packages