InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

📰 News

[2026.03.17] 🎉 We release our checkpoint trained on longer videos!
[2026.02.06] 🎉 Our paper has been selected as ORAL!! We will release our tokenizer checkpoints soon.
[2026.02.03] 📝 Check out our website for details about the intuition and results!
[2026.01.26] 🎉 Our paper has been accepted at ICLR 2026!

🔬 Overview

InfoTok is an adaptive discrete video tokenizer based on informational content. Unlike traditional tokenizers that use a fixed compression rate, InfoTok tokenizes videos into 1D sequences such that each token's information is balanced in a principled way, greatly improving the efficiency and semantical structure.

InfoTok adaptively tokenizes videos based on its complexity, achieving a highly compact representation.

InfoTok achieves superior reconstruction under identical compression rates.

🛠️ Installation

Prerequisites

Linux (tested on Ubuntu 20.04, 22.04, 24.04)
Python 3.10.x
NVIDIA GPU (H100-80GB or A100-80GB recommended)

Setup

# Create conda environment
conda env create --file infotok.yaml
conda activate infotok

# Set PYTHONPATH
export PYTHONPATH=$(pwd)

# Install dependencies
pip install -r requirements.txt

Checkpoint

We release the checkpoint of InfoTok-Flex post-trained on 81-frame temporal windows at different resolution on non-square videos. Download it with:

git lfs install

git clone https://huggingface.co/qyoo/infotok-flex

The checkpoint will be saved in infotok-flex/infotok_mse.pt

🚀 Inference

Quick Start

# Reconstruction
bash exp_scripts/infotok_inference.sh

# Visualize The Token Usage Detail
bash exp_scripts/infotok_inference.sh visualize_mask

Visualize the results at outputs/.

Detailed Usage

python3 -m cosmos_predict1.tokenizer.inference.video_cli \
    --video_pattern "/path/to/videos/*.mp4" \
    --checkpoint "/path/to/infotok_mse.pt" \
    --output_dir "/path/to/output" \
    --tokenizer_type OURS4x8x8-mse-256p-88 \
    --temporal_window 81 \
    --overlap_window 3 \
    --strategy global_elbo \
    --avg_rate 0.5 \
    --mode torch

Parameters

Parameter	Description	Default
`--video_pattern`	Glob pattern for input videos	Required
`--checkpoint`	Path to model checkpoint	Required
`--output_dir`	Output directory	Required
`--tokenizer_type`	Model architecture	`OURS4x8x8-mse-256p-88`
`--temporal_window`	Frames per window	`81`
`--overlap_window`	Overlap frames for blending	`3`
`--strategy`	Rate allocation (`global_elbo` or `elbo`)	`elbo`
`--avg_rate`	Target average token usage ratio (0.0625~1.0)	`0.5`

With global_elbo, the token budget is distributed across all temporal frames based on their individual ELBO values, enabling global allocation. In contrast, elbo applies the same avg_rate to all clips—however, tokens are still adaptively masked within each clip using ELBO router, resulting in different rates for each block.

📊 Evaluation

Dataset Preparation

pip install -U "huggingface_hub[cli]"

# TokenBench 240p
huggingface-cli download --repo-type dataset qyoo/tokenbench_240p --local-dir ./tokenbench_240p

# DAVIS 240p
huggingface-cli download --repo-type dataset qyoo/davis_240p --local-dir ./davis_240p

Reconstruction

python3 -m cosmos_predict1.tokenizer.inference.video_cli \
    --video_pattern "tokenbench_240p/*.mp4" \
    --checkpoint infotok-flex/infotok_mse.pt \
    --output_dir infotok_tokenbench_240p \
    --tokenizer_type OURS4x8x8-mse-256p-88 \
    --temporal_window 81 \
    --overlap_window 3 \
    --strategy elbo \
    --avg_rate 0.5 \
    --mode torch

Evaluation

We use TokenBench Repo for the evaluation. Please follow the setup instruction accordingly.

git clone https://github.com/NVlabs/TokenBench.git
cd TokenBench

pip install -r requirements.txt

For example, when evaluating reconstruction on TokenBench-240p:

python3 -m token_bench.metrics_cli --mode=psnr \
        --gtpath ../tokenbench_240p --targetpath ../infotok_tokenbench_240p

Expected Result

Dataset	Token Rate	Temporal Window	PSNR	SSIM
TokenBench - 240p	0.75	81	29.7088	0.8786
TokenBench - 240p	0.5	81	28.9674	0.8522
DAVIS - 240p	0.75	81	26.1951	0.7994
DAVIS - 240p	0.5	81	25.1283	0.7529

🏃 Post-Training

Additional Dependencies (Training Only)

Post-training requires additional dependencies:

# Patch Transformer Engine linking
ln -sf $CONDA_PREFIX/lib/python3.10/site-packages/nvidia/*/include/* $CONDA_PREFIX/include/
ln -sf $CONDA_PREFIX/lib/python3.10/site-packages/nvidia/*/include/* $CONDA_PREFIX/include/python3.10

# Install Transformer Engine
pip install transformer-engine[pytorch]==1.12.0

# Install Apex
git clone https://github.com/NVIDIA/apex && cd apex
CUDA_HOME=$CONDA_PREFIX pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation \
    --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" .
cd ..

Prepare Dataset

Register your dataset in cosmos_predict1/tokenizer/training/datasets/dataset_provider.py:

_VIDEO_PATTERN_DICT = {
    "custom_video": "/path/to/videos/*.mp4",
}

Run Training

Single GPU (Debug):

bash exp_scripts/infotok_posttrain.sh

Multi-GPU:

export PYTHONPATH=$(pwd)
export OUTPUT_ROOT="/path/to/checkpoints"

python -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint=localhost:29501 \
    -m cosmos_predict1.tokenizer.training.train \
    --config=cosmos_predict1/tokenizer/training/configs/config.py -- \
    experiment=ADV4x8x8_256p_CUSTOM_Posttrain \
    checkpoint.load_path=/path/to/infotok_mse.pt \
    checkpoint.strict_resume=False \
    checkpoint.load_training_state=False \
    dataloader_train.batch_size=1 \
    dataloader_train.dataset.num_video_frames=33

Output Structure

checkpoints/
└── infotok_posttraining/tokenizer/{NAME}/checkpoints/
    ├── iter_{N}.pt           # Full checkpoint
    ├── iter_{N}_enc.jit      # Encoder (JIT)
    ├── iter_{N}_dec.jit      # Decoder (JIT)
    └── iter_{N}_ema.jit      # EMA model

📜 Citation

If you find InfoTok useful, please cite:

@misc{ye2025infotok,
      title={InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression}, 
      author={Haotian Ye and Qiyuan He and Jiaqi Han and Puheng Li and Jiaojiao Fan and Zekun Hao and Fitsum Reda and Yogesh Balaji and Huayu Chen and Sheng Liu and Angela Yao and James Zou and Stefano Ermon and Haoxiang Wang and Ming-Yu Liu},
      year={2025},
      eprint={2512.16975},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.16975}, 
}

🙏 Acknowledgments

InfoTok is built on NVIDIA Cosmos-Predict1. We thank the Cosmos team for their infrastructure and pre-trained models.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
cosmos_predict1		cosmos_predict1
default_videos		default_videos
exp_scripts		exp_scripts
.gitignore		.gitignore
ATTRIBUTIONS.md		ATTRIBUTIONS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
infotok.yaml		infotok.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

📰 News

🔬 Overview

🛠️ Installation

Prerequisites

Setup

Checkpoint

🚀 Inference

Quick Start

Detailed Usage

Parameters

📊 Evaluation

Dataset Preparation

Reconstruction

Evaluation

Expected Result

🏃 Post-Training

Additional Dependencies (Training Only)

Prepare Dataset

Run Training

Output Structure

📜 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

📰 News

🔬 Overview

🛠️ Installation

Prerequisites

Setup

Checkpoint

🚀 Inference

Quick Start

Detailed Usage

Parameters

📊 Evaluation

Dataset Preparation

Reconstruction

Evaluation

Expected Result

🏃 Post-Training

Additional Dependencies (Training Only)

Prepare Dataset

Run Training

Output Structure

📜 Citation

🙏 Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages