Fish Audio S2

What makes S2 different

Built from the ground up for expressiveness, speed, and openness.

Ultra-Low Latency

Under 150ms response time enables real-time conversational AI, live dubbing, and interactive voice applications. Production-ready performance without compromising quality.

<150ms

Open Domain Control & Multi-Speaker

Control emotions, paralanguage, and more with natural text instructions. Add laughter, whispers, sighs, and any expressive element. Seamless multi-speaker conversations — switch between speakers naturally within a single generation.

<|speaker:1|> [giggles]

Fully Open-Source

Both inference code and model weights are fully open-source. Run S2 on your own infrastructure, fine-tune on your data, and integrate without vendor lock-in. Built for transparency and community-driven innovation.

Built with SGLang

Build with the Fish Audio S2 API

Generate lifelike speech in 80+ languages with emotion, direction, and multi-speaker control.

from fishaudio import FishAudio
from fishaudio.utils import save

# Initialize with your API key
client = FishAudio(api_key="your_api_key_here")

# Generate speech
audio = client.tts.convert(text="Fish Audio S2 is the best voice AI model.", model="s2-pro")
save(audio, "welcome.mp3")

Frequently asked questions

Fish Audio S2 Pro is a leading text-to-speech model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, it combines reinforcement learning alignment with a Dual-Autoregressive (Dual-AR) architecture — a 4B-parameter Slow AR for semantic prediction and a 400M-parameter Fast AR for acoustic detail. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.

S2 Pro enables localized control over speech generation by embedding natural-language instructions directly within the text using [tag] syntax. Rather than relying on a fixed set of predefined tags, S2 Pro accepts free-form textual descriptions — such as [whisper in small voice], [professional broadcast tone], or [pitch up] — allowing open-ended expression control at the word level. Over 15,000 unique tags are supported, including [pause], [emphasis], [laughing], [excited], [whisper], [singing], and many more.

On a single NVIDIA H200 GPU, S2 Pro achieves a Real-Time Factor (RTF) of 0.195, time-to-first-audio of ~100ms, and throughput of 3,000+ acoustic tokens per second while maintaining RTF below 0.5. The SGLang-based inference engine inherits all LLM-native serving optimizations — including continuous batching, paged KV cache, CUDA graph replay, and RadixAttention-based prefix caching.

S2 Pro supports 80+ languages. Tier 1 languages (highest quality) include Japanese, English, and Chinese. Tier 2 languages include Korean, Spanish, Portuguese, Arabic, Russian, French, and German. Many additional languages are supported including Swedish, Italian, Turkish, Dutch, Hindi, Thai, Vietnamese, and more.

S2 Pro is licensed under the Fish Audio Research License. Research and non-commercial use is permitted free of charge. Commercial use requires a separate license from Fish Audio — contact business@fish.audio for details.

Generate Unbelievably Realistic Speech

What makes S2 different

Ultra-Low Latency

Open Domain Control & Multi-Speaker

Fully Open-Source

Build with the Fish Audio S2 API

Frequently asked questions

What is Fish Audio S2 Pro?

How does fine-grained inline control work?

What is the streaming performance of S2 Pro?

How many languages does S2 Pro support?

What is the license for S2 Pro?