
Introducing SAM-Audio: Segment Anything in Audio
Meta AI releases SAM-Audio, the first foundation model for promptable audio segmentation using text, visual, and temporal prompts.
Foundation model for isolating any sound using text, visual, or temporal prompts • Meta AI Research • Fully open source

SAM-Audio combines advanced audio processing with Meta's Perception-Encoder Audio-Visual (PE-AV) technology to enable precise sound segmentation.
Multi-modal input: audio waveforms, text descriptions, video frames, or time spans
Supports various input modalities
Perception-Encoder Audio-Visual model for feature extraction
Multimodal correspondence learning
Encodes text, visual, or temporal prompts into embeddings
Unified prompt representation
Neural network that performs source separation based on prompts
Large-scale model
Generates separated target audio and residual (remainder)
High-fidelity audio output
First model to support text, visual, and temporal prompts in a single unified framework for audio separation
Built on large-scale multimodal correspondence learning between audio and visual modalities
Includes SAM-Audio Judge model for evaluating separation quality with multiple metrics
pip install sam-audio transformers torch torchaudio && huggingface-cli login
from sam_audio import SAMAudio, SAMAudioProcessor; model = SAMAudio.from_pretrained('facebook/sam-audio-large').eval()
inputs = processor(audios=['audio.wav'], descriptions=['A person speaking']); result = model.separate(inputs)

Meta AI releases SAM-Audio, the first foundation model for promptable audio segmentation using text, visual, and temporal prompts.

Learn how to write effective text prompts for precise audio segmentation with SAM-Audio.

Discover how SAM-Audio's visual prompting leverages PE-AV to map video objects to their corresponding sounds.

Learn how the SAM-Audio Judge model evaluates separation quality with four key metrics.

Use time-based prompts to teach SAM-Audio exactly what sound to isolate.

Deep dive into Perception-Encoder Audio-Visual technology that powers SAM-Audio's multimodal capabilities.
"SAM-Audio is a game-changer for audio post-production. Being able to isolate sounds with simple text descriptions saves hours of manual editing. The quality of separation is impressive, especially for dialogue extraction."
"The unified approach to audio segmentation is brilliant. Supporting text, visual, and temporal prompts in one model is exactly what the field needed. PE-AV integration makes visual prompting incredibly powerful."
"I use SAM-Audio to clean up interview footage and extract specific sounds from B-roll. The visual prompting feature is amazing - I can point at a person in the video and get just their audio. Massive time-saver!"
"The ability to isolate individual instruments from a mix is incredible. I've used it for remixing, creating stems, and learning from recordings. The Judge model helps me verify I'm getting clean separations."
"We're using SAM-Audio to improve audio accessibility in our applications. Being able to separate and enhance speech while removing background noise has significantly improved comprehension for hearing-impaired users."
"SAM-Audio handles overlapping speakers beautifully. The temporal anchor feature lets me train it on clean sections, then it separates the rest automatically. Best audio separation tool I've used."
SAM-Audio is a foundation model developed by Meta AI for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans. It's built on the Segment Anything Model technology and represents the first unified approach to promptable audio segmentation.
SAM-Audio supports three types of prompting: (1) Text Prompting - use natural language descriptions like 'a person coughing' or 'piano playing a melody', (2) Visual Prompting - isolate sounds associated with specific visual objects in video using masked video frames, and (3) Span Prompting (Temporal Anchors) - specify time ranges where the target sound occurs or doesn't occur to provide examples.
SAM-Audio is available in multiple sizes: sam-audio-small, sam-audio-base, and sam-audio-large. There are also -tv (TorchVision) variants for each size. The large model (facebook/sam-audio-large) offers the best performance and is recommended for most use cases.
You need to request access to the checkpoints on the SAM-Audio Hugging Face repository and authenticate with 'huggingface-cli login'. The models are gated and require accepting Meta's terms of use. After approval, you can use the models via the HuggingFace Transformers library.
SAM-Audio Judge (facebook/sam-audio-judge) is a companion model for evaluating the quality of audio separation results. It provides four quality metrics: overall quality, recall, precision, and faithfulness. This helps you assess how well the separated audio matches your text description.
Yes! You can enable candidate re-ranking by setting 'reranking_candidates=8' when calling model.separate(). This improves performance at the expense of increased latency. The model will evaluate multiple separation candidates and return the best one based on the Judge model's scoring.
PE-AV (Perception-Encoder Audio-Visual) is a large-scale multimodal model that SAM-Audio relies on for understanding correspondence between audio and visual modalities. It enables the visual prompting capability by learning relationships between what you see and what you hear, making it possible to isolate sounds based on visual cues.
SAM-Audio can be used for: extracting dialogue from noisy videos, isolating individual instruments in music recordings, removing background noise from recordings, separating overlapping voices in podcasts, creating karaoke tracks by removing vocals, extracting sound effects from movies, and cleaning up audio for accessibility purposes.
The model.separate() method returns two outputs: 'target' (the isolated sound you asked for) and 'residual' (everything else / the remainder). Both are torch.Tensor waveforms that you can save as audio files using libraries like torchaudio.
Yes, SAM-Audio is fully open source and licensed under the SAM License. The model weights, code, and research paper are all publicly available. You can find the official implementation at github.com/facebookresearch/sam-audio and models on Hugging Face.