SAM-Audio: Segment Anything in Audio

Foundation model for isolating any sound using text, visual, or temporal prompts • Meta AI Research • Fully open source

Loading Space...
SAM-Audio architecture diagram showing audio separation process

What is SAM-Audio?

SAM-Audio is a groundbreaking foundation model from Meta AI that revolutionizes audio separation. Unlike traditional audio editing tools that require complex manual processing, SAM-Audio can isolate any sound from complex audio mixtures using simple natural language descriptions, visual cues from video, or time spans. Built on Meta's Segment Anything Model (SAM) technology, SAM-Audio brings the same promptable segmentation capabilities that transformed computer vision to the audio domain. Whether you need to extract dialogue from a noisy recording, isolate a specific instrument from a musical performance, or separate overlapping sounds in a video, SAM-Audio makes it as simple as describing what you want. What sets SAM-Audio apart is its versatility: it supports text prompting (describe the sound), visual prompting (point to objects in video), and span prompting (specify time ranges). This unified approach to audio segmentation makes professional-grade audio editing accessible to everyone.
T
Text-based sound isolation
V
Visual object-to-audio mapping
C
Temporal span anchoring
O
Fully open source

SAM-Audio Architecture

SAM-Audio combines advanced audio processing with Meta's Perception-Encoder Audio-Visual (PE-AV) technology to enable precise sound segmentation.

Input Processing

input

Multi-modal input: audio waveforms, text descriptions, video frames, or time spans

Supports various input modalities

PE-AV Encoder

processing

Perception-Encoder Audio-Visual model for feature extraction

Multimodal correspondence learning

Prompt Encoder

core

Encodes text, visual, or temporal prompts into embeddings

Unified prompt representation

Text prompt encodingVisual mask processingTemporal anchor encodingCross-modal alignment

Separation Network

processing

Neural network that performs source separation based on prompts

Large-scale model

Output Generation

output

Generates separated target audio and residual (remainder)

High-fidelity audio output

Key Innovations

Unified Prompting

First model to support text, visual, and temporal prompts in a single unified framework for audio separation

PE-AV Foundation

Built on large-scale multimodal correspondence learning between audio and visual modalities

Quality Assessment

Includes SAM-Audio Judge model for evaluating separation quality with multiple metrics

Get Started in 3 Steps

1

Install & Authenticate

pip install sam-audio transformers torch torchaudio && huggingface-cli login

2

Load the Model

from sam_audio import SAMAudio, SAMAudioProcessor; model = SAMAudio.from_pretrained('facebook/sam-audio-large').eval()

3

Separate Audio

inputs = processor(audios=['audio.wav'], descriptions=['A person speaking']); result = model.separate(inputs)

Latest Updates

Introducing SAM-Audio: Segment Anything in Audio

Introducing SAM-Audio: Segment Anything in Audio

Meta AI releases SAM-Audio, the first foundation model for promptable audio segmentation using text, visual, and temporal prompts.

Meta AI Research
Jan 15, 2025
Mastering Text Prompting in SAM-Audio

Mastering Text Prompting in SAM-Audio

Learn how to write effective text prompts for precise audio segmentation with SAM-Audio.

SAM-Audio Team
Jan 16, 2025
Visual-to-Audio Mapping: Isolate Sounds by Pointing at Objects

Visual-to-Audio Mapping: Isolate Sounds by Pointing at Objects

Discover how SAM-Audio's visual prompting leverages PE-AV to map video objects to their corresponding sounds.

Research Team
Jan 17, 2025
Understanding SAM-Audio Judge: Quality Metrics for Audio Separation

Understanding SAM-Audio Judge: Quality Metrics for Audio Separation

Learn how the SAM-Audio Judge model evaluates separation quality with four key metrics.

SAM-Audio Team
Jan 18, 2025
Span Prompting with Temporal Anchors

Span Prompting with Temporal Anchors

Use time-based prompts to teach SAM-Audio exactly what sound to isolate.

Tutorial Team
Jan 19, 2025
The PE-AV Foundation: Multimodal Audio-Visual Learning

The PE-AV Foundation: Multimodal Audio-Visual Learning

Deep dive into Perception-Encoder Audio-Visual technology that powers SAM-Audio's multimodal capabilities.

Meta AI Research
Jan 20, 2025

What the Community Says

"SAM-Audio is a game-changer for audio post-production. Being able to isolate sounds with simple text descriptions saves hours of manual editing. The quality of separation is impressive, especially for dialogue extraction."
Sarah Mitchell
Sarah MitchellProfessional Sound Designer
"The unified approach to audio segmentation is brilliant. Supporting text, visual, and temporal prompts in one model is exactly what the field needed. PE-AV integration makes visual prompting incredibly powerful."
Dr. James Chen
Dr. James ChenAI Research Scientist
"I use SAM-Audio to clean up interview footage and extract specific sounds from B-roll. The visual prompting feature is amazing - I can point at a person in the video and get just their audio. Massive time-saver!"
Alex Rivera
Alex RiveraYouTube Video Editor
"The ability to isolate individual instruments from a mix is incredible. I've used it for remixing, creating stems, and learning from recordings. The Judge model helps me verify I'm getting clean separations."
Marcus Thompson
Marcus ThompsonIndependent Musician
"We're using SAM-Audio to improve audio accessibility in our applications. Being able to separate and enhance speech while removing background noise has significantly improved comprehension for hearing-impaired users."
Emily Rodriguez
Emily RodriguezAssistive Technology Specialist
"SAM-Audio handles overlapping speakers beautifully. The temporal anchor feature lets me train it on clean sections, then it separates the rest automatically. Best audio separation tool I've used."
David Park
David ParkPodcast Editor

Frequently Asked Questions

SAM-Audio is a foundation model developed by Meta AI for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans. It's built on the Segment Anything Model technology and represents the first unified approach to promptable audio segmentation.

SAM-Audio supports three types of prompting: (1) Text Prompting - use natural language descriptions like 'a person coughing' or 'piano playing a melody', (2) Visual Prompting - isolate sounds associated with specific visual objects in video using masked video frames, and (3) Span Prompting (Temporal Anchors) - specify time ranges where the target sound occurs or doesn't occur to provide examples.

SAM-Audio is available in multiple sizes: sam-audio-small, sam-audio-base, and sam-audio-large. There are also -tv (TorchVision) variants for each size. The large model (facebook/sam-audio-large) offers the best performance and is recommended for most use cases.

You need to request access to the checkpoints on the SAM-Audio Hugging Face repository and authenticate with 'huggingface-cli login'. The models are gated and require accepting Meta's terms of use. After approval, you can use the models via the HuggingFace Transformers library.

SAM-Audio Judge (facebook/sam-audio-judge) is a companion model for evaluating the quality of audio separation results. It provides four quality metrics: overall quality, recall, precision, and faithfulness. This helps you assess how well the separated audio matches your text description.

Yes! You can enable candidate re-ranking by setting 'reranking_candidates=8' when calling model.separate(). This improves performance at the expense of increased latency. The model will evaluate multiple separation candidates and return the best one based on the Judge model's scoring.

PE-AV (Perception-Encoder Audio-Visual) is a large-scale multimodal model that SAM-Audio relies on for understanding correspondence between audio and visual modalities. It enables the visual prompting capability by learning relationships between what you see and what you hear, making it possible to isolate sounds based on visual cues.

SAM-Audio can be used for: extracting dialogue from noisy videos, isolating individual instruments in music recordings, removing background noise from recordings, separating overlapping voices in podcasts, creating karaoke tracks by removing vocals, extracting sound effects from movies, and cleaning up audio for accessibility purposes.

The model.separate() method returns two outputs: 'target' (the isolated sound you asked for) and 'residual' (everything else / the remainder). Both are torch.Tensor waveforms that you can save as audio files using libraries like torchaudio.

Yes, SAM-Audio is fully open source and licensed under the SAM License. The model weights, code, and research paper are all publicly available. You can find the official implementation at github.com/facebookresearch/sam-audio and models on Hugging Face.