Saman Motamed1,2, William Harvey1, Benjamin Klein1, Luc Van Gool2, Zhuoning Yuan1, Ta-Ying Cheng1
1Netflix 2INSAIT, Sofia University "St. Kliment Ohridski"
VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but physical interactions like objects falling when a person is removed. It is built on top of CogVideoX and fine-tuned for video inpainting with interaction-aware mask conditioning.
Example: If a person holding a guitar is removed, VOID also removes the person's effect on the guitar — causing it to fall naturally.
teaser-with-name.mp4
VOID uses two transformer checkpoints, trained sequentially. You can run inference with Pass 1 alone or chain both passes for higher temporal consistency.
| Model | Description | HuggingFace |
|---|---|---|
| VOID Pass 1 | Base inpainting model | Download |
| VOID Pass 2 | Warped-noise refinement model | Download |
Place checkpoints anywhere and pass the path via --config.video_model.transformer_path (Pass 1) or --model_checkpoint (Pass 2).
The fastest way to try VOID is the included notebook — it handles setup, downloads the models, runs inference on a sample video, and displays the result:
Note: Requires a GPU with 40GB+ VRAM (e.g., A100).
For more control over the pipeline (custom videos, Pass 2 refinement, mask generation), see the full setup and instructions below.
pip install -r requirements.txtStage 1 of the mask pipeline uses Gemini via the Google AI API. Set your API key:
export GEMINI_API_KEY=your_key_hereAlso install SAM2 separately (required for mask generation):
git clone https://github.com/facebookresearch/sam2.git
cd sam2 && pip install -e .Download the pretrained base inpainting model from HuggingFace:
hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
--local-dir ./CogVideoX-Fun-V1.5-5b-InPThe inference and training scripts expect it at ./CogVideoX-Fun-V1.5-5b-InP relative to the repo root by default.
If ffmpeg is not available on your system, you can use the binary bundled with imageio-ffmpeg:
ln -sf $(python -c "import imageio_ffmpeg; print(imageio_ffmpeg.get_ffmpeg_exe())") ~/.local/bin/ffmpeg📁 Expected directory structure
After cloning the repo and downloading all assets, your directory should look like this:
VOID/
├── config/
├── datasets/
│ └── void_train_data.json
├── inference/
├── sample/ # included sample sequences for inference
├── scripts/
├── videox_fun/
├── VLM-MASK-REASONER/
├── README.md
├── requirements.txt
│
├── CogVideoX-Fun-V1.5-5b-InP/ # hf download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP
├── void_pass1.safetensors # download from huggingface.co/void-model (see Models above)
├── void_pass2.safetensors # download from huggingface.co/void-model (see Models above)
├── training_data/ # generated via data_generation/ pipeline (see Training section)
└── data_generation/ # data generation code (HUMOTO + Kubric pipelines)
Each video sequence lives in its own folder under a root data directory:
data_rootdir/
└── my-video/
├── input_video.mp4 # source video
├── quadmask_0.mp4 # quadmask (4-value mask video, see below)
└── prompt.json # {"bg": "background description"}
The prompt.json contains a single "bg" key describing the scene after the object has been removed — i.e. what you want the background to look like. Do not describe the object being removed; describe what remains.
{ "bg": "A table with a cup on it." } // ✅ describes the clean background
{ "bg": "A person being removed from scene." } // ❌ don't describe the removalA few examples from the included samples:
| Sequence | Removed object | bg prompt |
|---|---|---|
lime |
the glass | "A lime falls on the table." |
moving_ball |
the rubber duckie | "A ball rolls off the table." |
pillow |
the kettlebell being placed on the pillow | "Two pillows are on the table." |
The quadmask encodes four semantic regions per pixel:
| Value | Meaning |
|---|---|
0 |
Primary object to remove |
63 |
Overlap of primary + affected regions |
127 |
Affected region (interactions: falling objects, displaced items, etc.) |
255 |
Background (keep) |
🎭 Stage 1 — Generate Masks
The VLM-MASK-REASONER/ pipeline generates quadmasks from raw videos using SAM2 segmentation and a VLM (Gemini) for reasoning about interaction-affected regions.
python VLM-MASK-REASONER/point_selector_gui.pyLoad a JSON config listing your videos and instructions, then click on the objects to remove. Saves a *_points.json with the selected points.
Config format:
{
"videos": [
{
"video_path": "path/to/video.mp4",
"output_dir": "path/to/output/folder",
"instruction": "remove the person"
}
]
}After saving the points config, run all remaining stages automatically:
bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.jsonOptional flags:
bash VLM-MASK-REASONER/run_pipeline.sh my_config_points.json \
--sam2-checkpoint path/to/sam2_hiera_large.pt \
--device cudaThis runs the following stages in order:
| Stage | Script | Output |
|---|---|---|
| 1 — SAM2 segmentation | stage1_sam2_segmentation.py |
black_mask.mp4 |
| 2 — VLM analysis | stage2_vlm_analysis.py |
vlm_analysis.json |
| 3 — Grey mask generation | stage3a_generate_grey_masks_v2.py |
grey_mask.mp4 |
| 4 — Combine into quadmask | stage4_combine_masks.py |
quadmask_0.mp4 |
The final quadmask_0.mp4 in each video's output_dir is ready to use for inference.
🎬 Stage 2 — Inference
VOID inference runs in two passes. Pass 1 is sufficient for most videos; Pass 2 adds a warped-noise refinement step for better temporal consistency on longer clips.
python inference/cogvideox_fun/predict_v2v.py \
--config config/quadmask_cogvideox.py \
--config.data.data_rootdir="path/to/data_rootdir" \
--config.experiment.run_seqs="my-video" \
--config.experiment.save_path="path/to/output" \
--config.video_model.model_name="path/to/CogVideoX-Fun-V1.5-5b-InP" \
--config.video_model.transformer_path="path/to/void_pass1.safetensors"To run multiple sequences at once, pass a comma-separated list:
--config.experiment.run_seqs="video1,video2,video3"Key config options:
| Flag | Default | Description |
|---|---|---|
--config.data.sample_size |
384x672 |
Output resolution (HxW) |
--config.data.max_video_length |
197 |
Max frames to process |
--config.video_model.temporal_window_size |
85 |
Temporal window for multidiffusion |
--config.video_model.num_inference_steps |
50 |
Denoising steps |
--config.video_model.guidance_scale |
1.0 |
Classifier-free guidance scale |
--config.system.gpu_memory_mode |
model_cpu_offload_and_qfloat8 |
Memory mode (model_full_load, model_cpu_offload, sequential_cpu_offload) |
The output is saved as <save_path>/<sequence_name>.mp4, along with a *_tuple.mp4 side-by-side comparison.
Uses optical flow-warped latents from the Pass 1 output to initialize a second inference pass, improving temporal consistency.
Single video:
python inference/cogvideox_fun/inference_with_pass1_warped_noise.py \
--video_name my-video \
--data_rootdir path/to/data_rootdir \
--pass1_dir path/to/pass1_outputs \
--output_dir path/to/pass2_outputs \
--model_checkpoint path/to/void_pass2.safetensors \
--model_name path/to/CogVideoX-Fun-V1.5-5b-InPBatch: Edit the video list and paths in inference/pass_2_refine.sh, then run:
bash inference/pass_2_refine.shKey arguments:
| Argument | Default | Description |
|---|---|---|
--pass1_dir |
— | Directory containing Pass 1 output videos |
--output_dir |
./inference_with_warped_noise |
Where to save Pass 2 results |
--warped_noise_cache_dir |
./pass1_warped_noise_cache |
Cache for precomputed warped latents |
--temporal_window_size |
85 |
Temporal window size |
--height / --width |
384 / 672 |
Output resolution |
--guidance_scale |
6.0 |
CFG scale |
--num_inference_steps |
50 |
Denoising steps |
--use_quadmask |
True |
Use quadmask conditioning |
✏️ Stage 3 — Manual Mask Refinement (Optional)
If the auto-generated quadmask does not accurately capture the object or its interaction region, use the included GUI editor to refine it before running inference.
python VLM-MASK-REASONER/edit_quadmask.pyOpen a sequence folder containing input_video.mp4 (or rgb_full.mp4) and quadmask_0.mp4. The editor shows the original video and the editable mask side by side.
Tools:
- Grid Toggle — click a grid cell to toggle the interaction region (
127↔255) - Grid Black Toggle — click a grid cell to toggle the primary object region (
0↔255) - Brush (Add / Erase) — freehand paint or erase mask regions at pixel level
- Copy from Previous Frame — propagate the black or grey mask from the previous frame
Keyboard shortcuts: ← / → navigate frames, Ctrl+Z / Ctrl+Y undo/redo.
Save overwrites quadmask_0.mp4 in place. Rerun inference from Pass 1 after saving.
🏋️ Training
Due to licensing constraints on the underlying datasets, we release the data generation code instead of the pre-built training data. The code produces paired counterfactual videos (with/without object, plus quad-masks) from two sources:
Generates counterfactual videos from the HUMOTO motion capture dataset using Blender. A human (Remy/Sophie character) interacts with objects; removing the human causes objects to fall via physics simulation.
Prerequisites:
- HUMOTO dataset — Request access from the authors at adobe-research/humoto. Once approved, download and place under
data_generation/humoto_release/ - Blender — Install Blender (tested with 3.x and 4.x). Also install
opencv-python-headlessin Blender's Python (seedata_generation/README.md) - Remy & Sophie characters — Download from Mixamo (free Adobe account). Search for "Remy" and "Sophie", download each as FBX, and place at:
data_generation/human_model/Remy_mixamo_bone.fbx data_generation/human_model/Sophie_mixamo_bone.fbx - PBR textures (optional) — Download texture packs from ambientCG or Poly Haven. Without textures, objects render with realistic solid colors as fallback
Expected directory structure after setup:
data_generation/
├── humoto_release/
│ ├── humoto_0805/ # HUMOTO sequences (.pkl, .fbx, .yaml per sequence)
│ └── humoto_objects_0805/ # Object meshes (.obj, .fbx per object)
├── human_model/
│ ├── Remy_mixamo_bone.fbx # ← download from Mixamo
│ ├── Sophie_mixamo_bone.fbx # ← download from Mixamo
│ ├── bone_names.py # included
│ └── *.json # included (bone structure definitions)
├── textures/ # ← optional, user-provided PBR textures
├── physics_config.json # included (manual per-sequence physics settings)
├── render_paired_videos_blender_quadmask.py # main renderer
├── convert_split_remy_sophie.sh # character conversion script
└── ...
Pipeline:
cd data_generation
# 1. Convert HUMOTO sequences to Remy/Sophie characters
bash convert_split_remy_sophie.sh
# 2. Render paired videos (with human, without human, quad-mask)
blender --background --python render_paired_videos_blender_quadmask.py -- \
-d ./humoto_release/humoto_0805 \
-o ./output \
-s <sequence_name> \
-m ./humoto_release/humoto_objects_0805 \
--use_characters --enable_physics --add_walls \
--target_frames 60 --fps 12A pre-configured physics_config.json is included specifying which objects are static vs. dynamic per sequence. See data_generation/README.md for full details.
Generates counterfactual videos using Kubric with Google Scanned Objects. Objects are launched at a target; removing them alters the target's physics trajectory. No external dataset download required — assets are fetched from Google Cloud Storage.
cd data_generation
pip install kubric pybullet imageio imageio-ffmpeg
python kubric_variable_objects.py --num_pairs 200 --resolution 384Both pipelines output the same format expected by the training scripts:
training_data/
└── sequence_name/
├── rgb_full.mp4 # input video (with object)
├── rgb_removed.mp4 # target video (object removed, physics applied)
├── mask.mp4 # quad-mask (0/63/127/255)
└── metadata.json
Point the training scripts at your generated data by updating datasets/void_train_data.json.
Training proceeds in two stages. Pass 1 is trained first, then Pass 2 fine-tunes from that checkpoint.
Does not require warped noise. Trains the model to remove objects and their interactions from scratch.
bash scripts/cogvideox_fun/train_void.shKey arguments:
| Argument | Description |
|---|---|
--pretrained_model_name_or_path |
Path to base CogVideoX inpainting model |
--transformer_path |
Optional starting checkpoint |
--train_data_meta |
Path to dataset metadata JSON |
--train_mode="void" |
Enables void inpainting training mode |
--use_quadmask |
Trains with 4-value quadmask conditioning |
--use_vae_mask |
Encodes mask through VAE |
--output_dir |
Where to save checkpoints |
--num_train_epochs |
Number of epochs |
--checkpointing_steps |
Save a checkpoint every N steps |
--learning_rate |
Default 1e-5 |
Continues training from a Pass 1 checkpoint with optical flow-warped latent initialization, improving temporal consistency on longer videos. Requires warped noise for training data to be present.
bash scripts/cogvideox_fun/train_void_warped_noise.shSet TRANSFORMER_PATH to your Pass 1 checkpoint before running:
TRANSFORMER_PATH=path/to/pass1_checkpoint.safetensors bash scripts/cogvideox_fun/train_void_warped_noise.shAdditional arguments specific to this stage:
| Argument | Description |
|---|---|
--use_warped_noise |
Enables warped latent initialization during training |
--warped_noise_degradation |
Noise blending factor (default 0.3) |
--warped_noise_probability |
Fraction of steps using warped noise (default 1.0) |
Training was run on 8× A100 80GB GPUs using DeepSpeed ZeRO stage 2.
We are excited to see the community build on VOID!
Below we showcase selected demos, tools, and extensions.
If you’ve built something using VOID, feel free to submit a PR to add it here.
- ⭐ Gradio Demo — @sam-motamed
Interactive demo for trying VOID in the browser:
👉 https://huggingface.co/spaces/sam-motamed/VOID
This implementation builds on code and models from aigc-apps/VideoX-Fun, Gen-Omnimatte, Go-with-the-Flow, Kubric and HUMOTO. We thank the authors for sharing the codes and pretrained inpainting models for CogVideoX, Gen-Omnimatte, and the optical flow warping utilities.
If you find our work useful, please consider citing:
🔗 https://arxiv.org/abs/2604.02296
@misc{motamed2026void,
title={VOID: Video Object and Interaction Deletion},
author={Saman Motamed and William Harvey and Benjamin Klein and Luc Van Gool and Zhuoning Yuan and Ta-Ying Cheng},
year={2026},
eprint={2604.02296},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.02296}
}