This is the official repository for the ICCV 2025 paper:
Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos
Authors: Rundong Luo1, Matthew Wallingford2, Ali Farhadi2, Noah Snavely1, Wei-Chiu Ma1
1Cornell University, 2University of Washington
360° videos have emerged as a promising medium to represent our dynamic visual world. Compared to the "tunnel vision" of standard cameras, their borderless field of view offers a more holistic perspective of our surroundings. However, while existing video models excel at producing standard videos, their ability to generate full panoramic videos remains elusive. In this paper, we investigate the task of video-to-360° generation: given a perspective video as input, our goal is to generate a full panoramic video that is coherent with the input. Unlike conventional video generation tasks, the output's field of view is significantly larger, and the model is required to have a deep understanding of both the spatial layout of the scene and the dynamics of objects to maintain geometric and dynamic consistency with the input. To address these challenges, we first leverage the abundant 360° videos available online and develop a high-quality data filtering pipeline to curate pairwise training data. We then carefully design a series of geometry- and motion-aware modules to facilitate the learning process and improve the quality of 360° video generation. Experimental results demonstrate that our model can generate realistic and coherent 360° videos from arbitrary, in-the-wild perspective inputs. Additionally, we showcase its potential applications, including video stabilization, camera viewpoint control, and interactive visual question answering.
The framework is divided into three independent modules:
- 360° Video Generation
- Camera Trajectory Prediction
- Video Enhancement
Each module requires a separate Python environment due to potential package conflicts.
git clone --recurse-submodules https://github.com/Red-Fairy/argus-codeIf you already cloned the repo without submodules:
git submodule update --init --recursive- Create a conda environment named
360VG. - Install compatible versions of
torchandtorchvision. - Install dependencies from
requirements.txt.
Note:
Before installing the other packages, install numpy<2 to ensure compatibility with faiss.
We support two methods: MASt3R and MegaSaM.
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
git clone https://github.com/jenicek/asmk
cd asmk/cython/
cythonize *.pyx
cd ..
pip install .
cd ..We recommend creating a new conda environment for MegaSaM due to package conflicts.
Follow the instructions in the MegaSaM repository.
Required checkpoints:
- DepthAnything →
mega-sam/Depth-Anything/checkpoints/depth_anything_vitl14.pth - RAFT →
mega-sam/cvd_opt/raft-things.pth
To visualize camera trajectories:
pip install viser pyliblzfseWe recommend a separate conda environment named venhancer. Refer to VEnhancer to set up the enhancement pipeline.
First, download our pretrained model from Google Drive and place it in the checkpoints folder.
Launch the Gradio demo:
python gradio_demo.pyTo specify environments for depth and pose estimation:
python gradio_demo.py \
--mono_depth_env mega_sam \
--camera_pose_env mega_sam \
--venhancer_env venhancerYou can configure generation parameters, select calibration methods, and enable enhancement through the UI.
Run:
bash scripts/test/inference.sh \
[PATH_TO_UNET] \
[PATH_TO_VIDEO_OR_VIDEO_FOLDER] \
[PATH_TO_SAVE_FOLDER] \
[GUIDANCE_SCALE] \
[NUM_INFERENCE_STEPS]Download the videos listed in process_data/clips_info.jsonl.
We recommend using the yt-dlp tool to download the videos from YouTube.
Split each downloaded video into 10-second clips.
The start and end frames of each clip are specified in process_data/clips_info.jsonl.
As described in the supplementary material, our data processing pipeline consists of the following steps:
- Format Filtering. We first sample frames and compute the horizontal image gradient along the 180° boundary to verify whether the frame is 360°. We also compute the vertical gradient at the image center to remove videos that are split into two halves.
- Intra-frame Filtering. To detect improperly formatted videos, we compute LPIPS between the left and right halves to filter out 180° videos, and between the top and bottom halves to filter incorrectly formatted 360° videos.
- Inter-frame Filtering. To ensure temporal dynamics, we sample frames at random intervals and compute pixel variance. Static videos with minimal inter-frame variation are discarded. Videos with excessive black pixels are also removed.
- Clip-level filtering. We divide videos into 10-second clips and further filter them by (i) low optical flow magnitude using RAFT, (ii) shot boundary detection with TransNet-v2, and (iii) text detection with DPText-DETR applied to unwrapped perspective views. All these operations are performed on six 90° FoV projections of each video.
We provide sample scripts in process_data/pipeline to run the data processing pipeline for the first three steps. You don't need to run the pipeline if you are using our provided clips.
The clips used for training are listed in two files:
process_data/clips_filtered.txt: used for Stage 1 training.process_data/clips_filtered_high_quality.txt: used for Stage 2 training with stricter filtering criteria.
Each line in these files is formatted as: video_category\tvideo_id\tclip_id.
Begin by training the model using video clips at a resolution of 384×768.
Before running the script, update the following environment variables in scripts/train/train.sh:
- TRAIN_DATASET_PATH and VAL_DATASET_PATH: directories containing the training and validation video clips.
- TRAIN_CLIP_FILE_PATH and VAL_CLIP_FILE_PATH: text files listing the clip metadata for training and validation, formatted as:
video_category\tvideo_id\tclip_id.
To launch the training process:
bash scripts/train/train.sh [EXPERIMENT_NAME] 384 100000Next, fine-tune the model using high-quality clips at a resolution of 512×1024.
Again, modify the same four paths in scripts/train/train.sh as described above.
Additionally, provide the path to the pretrained UNet model from Stage 1.
To start Stage 2 training:
bash scripts/train/train.sh [EXPERIMENT_NAME] 512 20000 [PRETRAIN_UNET_PATH_OF_STAGE_1]The 101 evaluation clips and the real/simulation camera trajectories can be found here: Google Drive. For the trajectories, the FOVs are in degrees and the row/pitch/yaw are in radians. In our convention, the positive yaw is rotating rightward, the positive pitch is looking upward, and the positive row is tilting rightward.
If you find our work useful for your research or projects, please cite our paper:
@inproceedings{luo2025beyond,
title = {Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos},
author = {Luo, Rundong and Wallingford, Matthew and Farhadi, Ali and Snavely, Noah and Ma, Wei-Chiu},
booktitle = {IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025},
url = {https://arxiv.org/abs/2504.07940}
}