MARS-sep

Multimodal-Aligned Reinforced Sound Separation

Zihan Zhang 1,* Xize Cheng 1,* Zhennan Jiang 2 Dongjie Fu 1 Jingyuan Chen 1 Zhou Zhao 1 Tao Jin 1,†

1Zhejiang University     2CASIA
†Corresponding author. *Equal contribution

ICLR 2026
Code Paper πŸ€— Model

Abstract

Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. We introduce a preference alignment perspective, analogous to aligning LLMs with human intent. To address this, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is steered by a preference reward model and optimized by a stable, clipped trust-region surrogate. The reward, derived from a progressively-aligned audio-text-vision encoder, directly incentivizes semantic consistency with query prompts. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.

Qualitative samples in the MARS-Sep paper.

Query Mixture Interference Target OmniSep MARS-Sep
Image Image Image Image Image Image
Image Image Image Image Image Image
Image Image Image Image Image Image
Image Image Image Image Image Image

More samples with Queries of Different Modalities.

1.Text-Query

Query Mixture Interference Target Prediction
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image

2.Image-Query

Query Mixture Interference Target Prediction
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image

3.Audio-Query

Query Mixture Interference Target Prediction
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image
Image Image Image Image Image

More samples in the wild.

Video Query
(Text/Image/Audio)
Separated Audio

motorcycle running

Image

male commentary

Image

elephant trumpeting

Image