SAM-Audio Lab

What is SAM-Audio?

SAM-Audio is a groundbreaking foundation model from Meta AI that revolutionizes audio separation. Unlike traditional audio editing tools that require complex manual processing, SAM-Audio can isolate any sound from complex audio mixtures using simple natural language descriptions, visual cues from video, or time spans. Built on Meta's Segment Anything Model (SAM) technology, SAM-Audio brings the same promptable segmentation capabilities that transformed computer vision to the audio domain. Whether you need to extract dialogue from a noisy recording, isolate a specific instrument from a musical performance, or separate overlapping sounds in a video, SAM-Audio makes it as simple as describing what you want. What sets SAM-Audio apart is its versatility: it supports text prompting (describe the sound), visual prompting (point to objects in video), and span prompting (specify time ranges). This unified approach to audio segmentation makes professional-grade audio editing accessible to everyone.

Get Started in 3 Steps

Install & Authenticate

pip install sam-audio transformers torch torchaudio && huggingface-cli login

Load the Model

from sam_audio import SAMAudio, SAMAudioProcessor; model = SAMAudio.from_pretrained('facebook/sam-audio-large').eval()

Separate Audio

inputs = processor(audios=['audio.wav'], descriptions=['A person speaking']); result = model.separate(inputs)

Latest Updates

Introducing SAM-Audio: Segment Anything in Audio

Meta AI releases SAM-Audio, the first foundation model for promptable audio segmentation using text, visual, and temporal prompts.

Jan 15, 2025

Mastering Text Prompting in SAM-Audio

Learn how to write effective text prompts for precise audio segmentation with SAM-Audio.

Jan 16, 2025

Visual-to-Audio Mapping: Isolate Sounds by Pointing at Objects

Discover how SAM-Audio's visual prompting leverages PE-AV to map video objects to their corresponding sounds.

Jan 17, 2025

Understanding SAM-Audio Judge: Quality Metrics for Audio Separation

Learn how the SAM-Audio Judge model evaluates separation quality with four key metrics.

Jan 18, 2025

Span Prompting with Temporal Anchors

Use time-based prompts to teach SAM-Audio exactly what sound to isolate.

Jan 19, 2025

The PE-AV Foundation: Multimodal Audio-Visual Learning

Deep dive into Perception-Encoder Audio-Visual technology that powers SAM-Audio's multimodal capabilities.

Jan 20, 2025

What the Community Says

"SAM-Audio is a game-changer for audio post-production. Being able to isolate sounds with simple text descriptions saves hours of manual editing. The quality of separation is impressive, especially for dialogue extraction."

Sarah MitchellProfessional Sound Designer

"The unified approach to audio segmentation is brilliant. Supporting text, visual, and temporal prompts in one model is exactly what the field needed. PE-AV integration makes visual prompting incredibly powerful."

Dr. James ChenAI Research Scientist

"I use SAM-Audio to clean up interview footage and extract specific sounds from B-roll. The visual prompting feature is amazing - I can point at a person in the video and get just their audio. Massive time-saver!"

Alex RiveraYouTube Video Editor

"The ability to isolate individual instruments from a mix is incredible. I've used it for remixing, creating stems, and learning from recordings. The Judge model helps me verify I'm getting clean separations."

Marcus ThompsonIndependent Musician

"We're using SAM-Audio to improve audio accessibility in our applications. Being able to separate and enhance speech while removing background noise has significantly improved comprehension for hearing-impaired users."

Emily RodriguezAssistive Technology Specialist

"SAM-Audio handles overlapping speakers beautifully. The temporal anchor feature lets me train it on clean sections, then it separates the rest automatically. Best audio separation tool I've used."

David ParkPodcast Editor

Frequently Asked Questions

SAM-Audio is a foundation model developed by Meta AI for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans. It's built on the Segment Anything Model technology and represents the first unified approach to promptable audio segmentation.

SAM-Audio supports three types of prompting: (1) Text Prompting - use natural language descriptions like 'a person coughing' or 'piano playing a melody', (2) Visual Prompting - isolate sounds associated with specific visual objects in video using masked video frames, and (3) Span Prompting (Temporal Anchors) - specify time ranges where the target sound occurs or doesn't occur to provide examples.

SAM-Audio is available in multiple sizes: sam-audio-small, sam-audio-base, and sam-audio-large. There are also -tv (TorchVision) variants for each size. The large model (facebook/sam-audio-large) offers the best performance and is recommended for most use cases.

You need to request access to the checkpoints on the SAM-Audio Hugging Face repository and authenticate with 'huggingface-cli login'. The models are gated and require accepting Meta's terms of use. After approval, you can use the models via the HuggingFace Transformers library.

SAM-Audio Judge (facebook/sam-audio-judge) is a companion model for evaluating the quality of audio separation results. It provides four quality metrics: overall quality, recall, precision, and faithfulness. This helps you assess how well the separated audio matches your text description.

Yes! You can enable candidate re-ranking by setting 'reranking_candidates=8' when calling model.separate(). This improves performance at the expense of increased latency. The model will evaluate multiple separation candidates and return the best one based on the Judge model's scoring.

PE-AV (Perception-Encoder Audio-Visual) is a large-scale multimodal model that SAM-Audio relies on for understanding correspondence between audio and visual modalities. It enables the visual prompting capability by learning relationships between what you see and what you hear, making it possible to isolate sounds based on visual cues.

SAM-Audio can be used for: extracting dialogue from noisy videos, isolating individual instruments in music recordings, removing background noise from recordings, separating overlapping voices in podcasts, creating karaoke tracks by removing vocals, extracting sound effects from movies, and cleaning up audio for accessibility purposes.

The model.separate() method returns two outputs: 'target' (the isolated sound you asked for) and 'residual' (everything else / the remainder). Both are torch.Tensor waveforms that you can save as audio files using libraries like torchaudio.

Yes, SAM-Audio is fully open source and licensed under the SAM License. The model weights, code, and research paper are all publicly available. You can find the official implementation at github.com/facebookresearch/sam-audio and models on Hugging Face.

SAM-Audio: Segment Anything in Audio

What is SAM-Audio?

SAM-Audio Architecture

Input Processing

PE-AV Encoder

Prompt Encoder

Separation Network

Output Generation

Key Innovations

Unified Prompting

PE-AV Foundation

Quality Assessment