How do I get an API key?

Sign up for a free TTS.ai account, then navigate to your account dashboard and click "Generate API Key." Your key will be prefixed with sk-tts- and can be used immediately. Free accounts receive 15,000 characters to get started.

Is the API compatible with OpenAI's format?

Yes, our API follows OpenAI-compatible request and response formats. If you have existing code that uses OpenAI's TTS API, you can switch to TTS.ai by changing the base URL and API key with minimal code changes.

What programming languages are supported?

The REST API works with any language that can make HTTP requests. We provide code examples in Python, JavaScript (Node.js and browser), cURL, and more. Any language with an HTTP client library (Go, Ruby, Java, C#, PHP, etc.) can use the API.

What are the API rate limits?

Free accounts are limited to 3 requests per hour. Paid plans have higher limits based on your subscription tier: Starter (60/hour), Professional (300/hour), Enterprise (unlimited). Rate limit headers are included in every API response.

How does API pricing work?

API usage consumes characters based on the model tier and text length. Free models use 0 characters, standard models use 2x characters, and premium models use 4x characters. Characters are included in all paid plans and can also be purchased separately as character packs.

What endpoints are available?

The API provides endpoints for text-to-speech (POST /v1/tts/), speech-to-text (POST /v1/transcribe/), voice cloning (POST /v1/voice-clone/), voice conversion (POST /v1/voice-convert/), speech translation (POST /v1/speech-translate/), audio enhancement (POST /v1/audio-enhance/), vocal removal, stem splitting, key and BPM analysis, and more.

What audio formats does the API return?

The API returns audio in WAV format by default. You can specify the output format (mp3, wav, ogg, flac) using the response_format parameter. MP3 is recommended for web applications, WAV for further audio processing.

Is there a streaming API for real-time TTS?

Yes, our async API returns a job UUID that you can poll for results. For supported models like Kokoro, audio generation is fast enough for near-real-time applications. The polling endpoint returns the audio URL when processing is complete.

How do I handle errors in the API?

The API returns standard HTTP status codes (400 for bad requests, 401 for auth errors, 429 for rate limits, 500 for server errors) with JSON error messages. Always check the status code and error field in responses for proper error handling.

Can I use the API for commercial applications?

Yes, the API is designed for commercial use. Audio generated through the API can be used in your products, applications, and services. All models use open-source licenses, and there are no additional royalties on generated audio.

Is there a sandbox or testing environment?

Free-tier models (Kokoro, Piper, VITS, MeloTTS) serve as an excellent sandbox — they use zero characters and are available to all accounts. Test your integration with free models before switching to premium models for production use.

How do I list available voices and models via the API?

Use GET /v1/voices to list all available voices with filtering options (model, language, gender). Use GET /v1/models to list all available TTS models with their capabilities and tier information. Both endpoints return JSON responses.

Report Bug / Feature Request

API Documentation

Integrate TTS.ai into your applications with our REST API. OpenAI-compatible format for easy migration.

REST API OpenAI Compatible JSON Responses Streaming Support

Overview

The TTS.ai API provides programmatic access to all platform features: text-to-speech synthesis, speech-to-text transcription, voice cloning, audio enhancement, and more. The API uses standard REST conventions with JSON request/response bodies.

API Key

Get your API key from Account Settings. Available on all plans, including free accounts.

Base URL

https://api.tts.ai/v1/

Auth

Bearer token via Authorization header

Authentication

Free tier — no key required. Anonymous POSTs to /v1/tts/ work without any auth, up to 5,000 characters/day per IP, using any of our free models (piper, vits, melotts, kokoro). Sign up for a free account to get 15,000 bonus characters and access to premium models.

For premium models and higher rate limits, authenticate with a Bearer token in the Authorization header.

HTTP Header

Authorization: Bearer sk-tts-your-api-key-here

Keep your API key secret. Do not share it in client-side code, public repositories, or logs. Rotate keys regularly from your account settings.

SDKs

Official SDKs make it easy to integrate TTS.ai into your application. Both are open source and available on GitHub.

Python

pip install ttsai

from tts_ai import TTSClient

client = TTSClient(api_key="sk-tts-...")
audio = client.generate(
    text="Hello world!",
    model="kokoro"
)
client.save(audio, "output.wav")

GitHub

JavaScript / Node.js

npm install @ttsainpm/ttsai

const { TTSClient } = require('@ttsainpm/ttsai');

const client = new TTSClient({
  apiKey: 'sk-tts-...'
});
const audio = await client.generate({
  input: 'Hello world!',
  model: 'kokoro'
});
await client.saveToFile(audio, 'output.wav');

GitHub

Base URL

Base URL: https://api.tts.ai/v1/

All endpoints are relative to this base URL. For example, the TTS endpoint is:

POST https://api.tts.ai/v1/tts/

Rate Limits

API rate limits vary by plan:

Plan	Requests/min	Concurrent	Max Text Length
Free	10	2	500 chars
Starter	30	3	1,000,000 chars
Pro	60	5	1,000,000 chars
Business+	300	20	50,000 chars

Rate limit headers are included in every response: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.

Character Usage

Service	Cost	Unit
TTS (Free models: Piper, VITS, MeloTTS)	1,000 characters	per 1,000 characters
TTS (Standard models: Kokoro, CosyVoice 2, etc.)	2,000 characters	per 1,000 characters
TTS (Premium models: Tortoise, Chatterbox, etc.)	4,000 characters	per 1,000 characters
Speech to Text	2,000 characters	per minute of audio
Voice Cloning	4,000 characters	per 1,000 characters
Voice Changer	3,000 characters	per minute of audio
Audio Enhancement	2,000 characters	per minute of audio
Vocal Removal / Stem Splitting	3,000-4,000 characters	per minute of audio
Speech Translation	5,000 characters	per minute of audio
Voice Chat	3,000 characters	per turn
Key & BPM Finder	Free	--
Audio Converter	Free	--

Text to Speech

POST /v1/tts/

Convert text to speech audio. Returns audio file in the requested format.

Request Body

Parameter	Type	Required	Description
model	string	No	Model ID (e.g., `kokoro`, `chatterbox`, `piper`). If omitted, we auto-pick a model that supports the requested `language` — `kokoro` for en/ja/zh/ko/fr/de/it/pt/es/hi/ru, `piper` for other supported languages (ar/pl/nl/cs/da/fi/el/hu/tr/uk/vi/etc.).
text	string	Yes	Text to convert to speech. Per-request cap: 500 chars (anonymous), 5,000 (free account), 1,000,000 (paid plan). Long inputs are auto-chunked server-side.
voice	string	Yes	Voice ID (use `/v1/voices/` to list available voices)
format	string	No	Output format: `mp3` (default), `wav`, `flac`, `ogg`
speed	float	No	Speaking speed multiplier. Default: `1.0`. Range: `0.5` to `2.0`
language	string	No	Language code (e.g., `en`, `es`). Auto-detected if omitted.
instructions	string	No	Acting / delivery cues (≤500 chars). e.g. `\`
pronunciations	object \| array	No	Per-request pronunciation overrides. Either `{\`
stream	boolean	No	Enable streaming response. Default: `false`

Example Request

cURL

curl -X POST https://api.tts.ai/v1/tts/ \
  -H "Authorization: Bearer sk-tts-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "text": "Hello from TTS.ai! This is a test.",
    "voice": "af_bella",
    "format": "mp3"
  }' \
  --output output.mp3

SSML tags

Wrap numbers, dates, currency, phone numbers, and acronyms in


                    
                        
                            interpret-as Input Spoken as
                        
                        
                            cardinal 1234 one thousand two hundred thirty-four
                            ordinal 21 twenty-first
                            date 1999-12-31 December thirty-first, nineteen ninety-nine
                            time 14:30 two thirty PM
                            telephone +1-555-867-5309 plus one five five five eight six seven…
                            currency $1,234.56 one thousand two hundred thirty-four dollars and fifty-six cents
                            spell-out NASA N A S A

interpret-as	Input	Spoken as
`cardinal`	`1234`	one thousand two hundred thirty-four
`ordinal`	`21`	twenty-first
`date`	`1999-12-31`	December thirty-first, nineteen ninety-nine
`time`	`14:30`	two thirty PM
`telephone`	`+1-555-867-5309`	plus one five five five eight six seven…
`currency`	`$1,234.56`	one thousand two hundred thirty-four dollars and fifty-six cents
`spell-out`	`NASA`	N A S A


                Date format defaults to mdy for English and dmy elsewhere; override with format=\

                
                    
                        Example
                        
                    
{
  "model": "kokoro",
  "voice": "af_bella",
  "text": "Your appointment is on <say-as interpret-as=\"date\">2026-04-26</say-as> at <say-as interpret-as=\"time\">14:30</say-as>. Please call <say-as interpret-as=\"telephone\">+1-555-867-5309</say-as> if you need to reschedule."
}
                

                Response
                The TTS endpoint queues your request and returns a JSON response with a job UUID. You then poll for the result.

                Step 1: Submit request
                
                    
                        Response (JSON)
                    
{
  "uuid": "77b71db532874ce98e84a69a2d740d4c",
  "job_id": "f21316bb-aefa-480d-8523-701d1e3184ce",
  "status": "queued",
  "credits_used": 11,
  "credits_remaining": 15000
}
                

                Step 2: Poll for result
                
                    GET /v1/speech/results/?uuid=<job_uuid>
                
                Poll this endpoint every 1-2 seconds until status is completed or failed.
                
                    
                        Polling response (completed)
                    
{
  "status": "completed",
  "result_url": "https://api.tts.ai/static/downloads/77b71db5.../output.mp3"
}
                
                
                    
                        Polling response (still processing)
                    
{
  "status": "processing"
}
                

                Step 3: Download audio
                Fetch the result_url from the completed response to download the audio file.

                Full example
                
                    
                        Python
                        
                    
import requests, time

API_KEY = "sk-tts-your-key"
BASE = "https://api.tts.ai"

# 1. Submit TTS request
resp = requests.post(f"{BASE}/v1/tts/", json={
    "model": "kokoro",
    "text": "Hello from TTS.ai!",
    "voice": "af_bella"
}, headers={"Authorization": f"Bearer {API_KEY}"})
data = resp.json()
uuid = data["uuid"]

# 2. Poll for result
while True:
    result = requests.get(f"{BASE}/v1/speech/results/",
        params={"uuid": uuid}).json()
    if result["status"] == "completed":
        # 3. Download audio
        audio = requests.get(result["result_url"])
        with open("output.mp3", "wb") as f:
            f.write(audio.content)
        break
    elif result["status"] == "failed":
        raise Exception(result.get("error", "Generation failed"))
    time.sleep(1.5)
                

                Streaming alternative: For supported models (Kokoro, MeloTTS), use POST /v1/tts/stream/ for real-time Server-Sent Events (SSE) streaming — no polling needed.



            
            
                Speech to Text
                
                    POST /v1/stt/
                
                Transcribe audio to text. Supports 99 languages with auto-detection.

                Request Body (multipart/form-data)
                
                    
                        
                            Parameter Type Required Description
                        
                        
                            
                                file
                                file
                                Yes
                                Audio file (MP3, WAV, FLAC, OGG, M4A, MP4, WebM). Max 100MB.
                            
                            
                                model
                                string
                                No
                                STT model: whisper (default), faster-whisper, sensevoice
                            
                            
                                language
                                string
                                No
                                Language code. auto for auto-detection (default).
                            
                            
                                timestamps
                                boolean
                                No
                                Include word-level timestamps. Default: false
                            
                            
                                diarize
                                boolean
                                No
                                Enable speaker diarization. Default: false
                            
                        
                    
                

                Response
                
                    
                        JSON Response
                    
{
  "text": "Hello, this is a transcription test.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "start": 0.0,
      "end": 1.8,
      "text": "Hello, this is",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 1.8,
      "end": 3.5,
      "text": "a transcription test.",
      "speaker": "SPEAKER_00"
    }
  ]
}
                
            

            
            
                Voice Cloning
                
                    POST /v1/tts/clone/
                
                Generate speech in a cloned voice. Upload a reference audio and text.

                Request Body (multipart/form-data)
                
                    
                        
                            Parameter Type Required Description
                        
                        
                            
                                reference_audio
                                file
                                Yes
                                Reference voice audio (10-30 seconds recommended). Max 20MB.
                            
                            
                                text
                                string
                                Yes
                                Text to speak in the cloned voice.
                            
                            
                                model
                                string
                                No
                                Clone model: chatterbox (default), cosyvoice2, gpt-sovits
                            
                            
                                format
                                string
                                No
                                Output format: mp3 (default), wav, flac
                            
                            
                                language
                                string
                                No
                                Target language code. Must be supported by the chosen model.
                            
                        
                    
                

                Response
                Returns the audio file as binary data, same as the TTS endpoint.
            

            
            
                Voice Changer
                
                    POST /v1/voice-convert/
                
                Convert audio to sound like a different voice. Upload source audio and choose a target voice.

                Request Body (multipart/form-data)
                
                    
                        
                            Parameter Type Required Description
                        
                        
                            
                                file
                                file
                                Yes
                                Source audio file (MP3, WAV, FLAC). Max 50MB.
                            
                            
                                target_voice
                                string
                                Yes
                                Target voice ID to convert to (use /v1/voices/ to list available voices)
                            
                            
                                model
                                string
                                No
                                Voice conversion model: openvoice (default), knn-vc
                            
                            
                                format
                                string
                                No
                                Output format: wav (default), mp3, flac
                            
                        
                    
                

                Example Request
                
                    
                        cURL
                        
                    
curl -X POST https://api.tts.ai/v1/voice-convert/ \
  -H "Authorization: Bearer sk-tts-your-key" \
  -F "file=@source_audio.mp3" \
  -F "target_voice=af_bella" \
  -F "model=openvoice" \
  -o converted.wav
                

                Response
                Returns the converted audio file as binary data.
            

            
            
                Speech Translation
                
                    POST /v1/speech-translate/
                
                Translate spoken audio from one language to another. Combines speech-to-text, translation, and text-to-speech in a single call.

                Request Body (multipart/form-data)
                
                    
                        
                            Parameter Type Required Description
                        
                        
                            
                                file
                                file
                                Yes
                                Source audio file in the original language. Max 100MB.
                            
                            
                                target_language
                                string
                                Yes
                                Target language code (e.g., es, fr, de, ja)
                            
                            
                                voice
                                string
                                No
                                Voice for translated output. Auto-selected if omitted.
                            
                            
                                preserve_voice
                                boolean
                                No
                                Attempt to preserve the original speaker's voice characteristics. Default: false
                            
                        
                    
                

                Response
                
                    JSON Response
{
  "original_text": "Hello, how are you?",
  "translated_text": "Hola, como estas?",
  "source_language": "en",
  "target_language": "es",
  "audio_url": "https://api.tts.ai/v1/results/translate_abc123.mp3",
  "credits_used": 5
}
                
            

            
            
                Speech to Speech
                
                    POST /v1/speech-to-speech/
                
                Transform speech style, emotion, or delivery while keeping the content. Useful for adjusting tone, pacing, and expressiveness.

                Request Body (multipart/form-data)
                
                    
                        
                            Parameter Type Required Description
                        
                        
                            
                                file
                                file
                                Yes
                                Source speech audio file. Max 50MB.
                            
                            
                                voice
                                string
                                Yes
                                Target voice ID for the output speech
                            
                            
                                model
                                string
                                No
                                Model: openvoice (default), chatterbox
                            
                            
                                emotion
                                string
                                No
                                Target emotion: neutral, happy, sad, angry, excited
                            
                            
                                speed
                                float
                                No
                                Speed adjustment. Default: 1.0. Range: 0.5 to 2.0
                            
                        
                    
                

                Response
                Returns the transformed audio file as binary data.
            

            
            
                Audio Tools
                Audio processing endpoints for enhancement, vocal removal, stem splitting, and more.

                
                
                    
                        
                            POST /v1/audio/enhance/
                        
                        Enhance audio quality: denoise, improve clarity, super resolution.
                        
                            
                                
                                    file file Audio file to enhance
                                    denoise boolean Enable denoising (default: true)
                                    enhance_clarity boolean Enhance speech clarity (default: true)
                                    super_resolution boolean Upscale audio quality (default: false)
                                    strength integer 1-3 (light, medium, strong). Default: 2
                                
                            
                        
                    
                

                
                
                    
                        
                            POST /v1/audio/separate/
                        
                        Separate vocals from instrumentals (vocal removal) or split into stems.
                        
                            
                                
                                    file file Audio file to separate
                                    model string demucs (default) or spleeter
                                    stems integer Number of stems: 2, 4, 5, or 6 (default: 2)
                                    format string Output format: wav, mp3, flac
                                
                            
                        
                    
                

                
                
                    
                        
                            POST /v1/audio/dereverb/
                        
                        Remove echo and reverb from audio recordings.
                        
                            
                                
                                    file file Audio file to process
                                    type string echo or reverb (default: both)
                                    intensity integer 1-5 (default: 3)
                                
                            
                        
                    
                

                
                
                    
                        
                            POST /v1/audio/analyze/
                            Free
                        
                        Analyze audio to detect key, BPM, and time signature.
                        
                            
                                
                                    file file Audio file to analyze
                                
                            
                        
                        
                            Response
{
  "key": "C",
  "scale": "Major",
  "bpm": 120.0,
  "time_signature": "4/4",
  "camelot": "8B",
  "compatible_keys": ["C Major", "G Major", "F Major", "A Minor"]
}
                        
                    
                

                
                
                    
                        
                            POST /v1/audio/convert/
                            Free
                        
                        Convert audio between formats.
                        
                            
                                
                                    file file Audio file to convert
                                    format string Target format: mp3, wav, flac, ogg, m4a, aac
                                    bitrate integer Output bitrate in kbps: 64, 128, 192, 256, 320
                                    sample_rate integer Sample rate: 22050, 44100, 48000
                                    channels string mono or stereo
                                
                            
                        
                    
                
            

            
            
                Voice Chat
                
                    POST /v1/voice-chat/
                
                Send audio or text and receive an AI response with synthesized speech.

                Request Body (multipart/form-data or JSON)
                
                    
                        
                            Parameter Type Required Description
                        
                        
                            
                                audio
                                file
                                No*
                                Audio input (either audio or text required)
                            
                            
                                text
                                string
                                No*
                                Text input (either audio or text required)
                            
                            
                                voice
                                string
                                No
                                Voice for AI response. Default: af_bella
                            
                            
                                tts_model
                                string
                                No
                                TTS model for response. Default: kokoro
                            
                            
                                system_prompt
                                string
                                No
                                Custom system prompt for the AI
                            
                            
                                conversation_id
                                string
                                No
                                Continue an existing conversation
                            
                        
                    
                

                Response
                
                    JSON Response
{
  "conversation_id": "conv_abc123",
  "user_text": "What is the capital of France?",
  "ai_text": "The capital of France is Paris.",
  "audio_url": "https://api.tts.ai/v1/audio/tmp/resp_xyz.mp3",
  "credits_used": 3
}
                
            

            
            
                Batch TTS
                
                    POST /v1/tts/batch/
                
                Submit multiple texts for parallel TTS generation. Optionally receive a webhook callback when all jobs complete.

                Parameters
                
                    Parameter Type Description
                    
                        texts array Array of objects: {text, model, voice}. Max 50 items.
                        webhook_url string Optional URL to POST results when batch completes.
                    
                

                Response
                
                    JSON Response
{
  "batch_id": "abc123",
  "total": 3,
  "completed": 0,
  "status": "processing"
}
                
                Poll progress with GET /v1/tts/batch/result/?batch_id=abc123
            

            
            
                Voice Embedding
                
                    POST /v1/voice-embed/
                
                Pre-compute a voice embedding from reference audio. Use the returned embed_id in subsequent voice cloning requests for near-instant generation.

                Parameters
                
                    Parameter Type Description
                    
                        file file Reference audio file (WAV, MP3, FLAC).
                        model string Cloning model (default: chatterbox). Supported: chatterbox, cosyvoice2, openvoice, gpt-sovits, spark, indextts2, qwen3-tts.
                    
                

                Response
                
                    JSON Response
{
  "embed_id": "emb_abc123",
  "model": "chatterbox",
  "duration_ms": 450
}
                
            

            
            
                Health Check
                
                    GET /v1/health/
                
                Check GPU server status, loaded models, and queue size. No authentication required. Cached for 30 seconds.

                Response
                
                    JSON Response
{
  "status": "online",
  "latency_ms": 45,
  "queue_size": 3,
  "models_loaded": ["kokoro", "chatterbox", "cosyvoice2"]
}
                
            

            
            
                List Models
                
                    GET /v1/models/
                
                Returns a list of all available models with their capabilities.

                Response
                
                    JSON Response
{
  "models": [
    {
      "id": "kokoro",
      "name": "Kokoro",
      "type": "tts",
      "tier": "standard",
      "languages": ["en", "ja", "ko", "zh", "fr"],
      "supports_cloning": false,
      "supports_streaming": true,
      "credits_per_1k_chars": 2
    },
    {
      "id": "chatterbox",
      "name": "Chatterbox",
      "type": "tts",
      "tier": "premium",
      "languages": ["en"],
      "supports_cloning": true,
      "supports_streaming": true,
      "credits_per_1k_chars": 4
    }
  ]
}
                
            

            
            
                List Voices
                
                    GET /v1/voices/
                
                Returns a list of all available voices, optionally filtered by model or language.

                Query Parameters
                
                    
                        
                            Parameter Type Description
                        
                        
                            
                                model
                                string
                                Filter by model ID (e.g., kokoro)
                            
                            
                                language
                                string
                                Filter by language code (e.g., en)
                            
                            
                                gender
                                string
                                Filter by gender: male, female, neutral
                            
                        
                    
                

                Response
                
                    JSON Response
{
  "voices": [
    {
      "id": "af_bella",
      "name": "Bella",
      "model": "kokoro",
      "language": "en",
      "gender": "female",
      "preview_url": "https://api.tts.ai/v1/voices/preview/af_bella.mp3"
    }
  ],
  "total": 142
}
                
            

            
            
                Subtitles (SRT / VTT) new
                
                    GET /v1/speech/subtitles/?uuid=<job_uuid>&format=srt|vtt&download=1
                
                Generate synchronised subtitles for any completed TTS job. Runs Whisper alignment over the audio and returns SRT or WebVTT. Result is cached on disk so a second call for the same uuid is a disk read.
                Query Parameters
                
                    
                        Parameter Required Description
                        
                            uuid Yes Job UUID returned by /v1/tts/ or /v1/voice-clone/.
                            format No srt (default) or vtt.
                            download No 1 to send Content-Disposition: attachment so the browser saves rather than displays.
                            language No Hint to the alignment model (auto-detected if omitted).
                        
                    
                
                
                    cURL
curl "https://api.tts.ai/v1/speech/subtitles/?uuid=$UUID&format=srt&download=1" -o subtitles.srt
                
            

            
            
                Pronunciation Dictionary new
                
                    GET
                    POST
                    DELETE
                    /api/v1/pronunciations/
                
                Tell the TTS engine how to pronounce specific words. Saved entries auto-apply to every TTS request you make. 200-entry per-account limit.
                Request Body (POST)
                
                    
                        Parameter Type Description
                        
                            word string Word to override (e.g. GIF, Anthropic). Word-boundary matched.
                            replacement string How to spell it for the model (e.g. jiff, ann THROP ick).
                            language string Optional ISO code. Empty = applies to all languages.
                            case_sensitive boolean Default false. Match case exactly when true.
                        
                    
                
                
                    cURL
# Save an entry
curl -X POST https://tts.ai/api/v1/pronunciations/ \
  -H "Authorization: Bearer sk-tts-..." \
  -H "Content-Type: application/json" \
  -d '{"word": "GIF", "replacement": "jiff"}'

# List your entries
curl https://tts.ai/api/v1/pronunciations/ -H "Authorization: Bearer sk-tts-..."

# Delete entry by id
curl -X DELETE "https://tts.ai/api/v1/pronunciations/?id=42" -H "Authorization: Bearer sk-tts-..."
                
                You can also pass per-request overrides without saving them — include pronunciations on any /v1/tts/ call as either an object or an array (see the TTS endpoint params).
            

            
            
                Article Narrator new
                Drop a single 

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    


    
    
    

Advertisement

Advertisement

Parameter	Type	Required	Description
file	file	Yes	Audio file (MP3, WAV, FLAC, OGG, M4A, MP4, WebM). Max 100MB.
model	string	No	STT model: `whisper` (default), `faster-whisper`, `sensevoice`
language	string	No	Language code. `auto` for auto-detection (default).
timestamps	boolean	No	Include word-level timestamps. Default: `false`
diarize	boolean	No	Enable speaker diarization. Default: `false`

Parameter	Type	Required	Description
reference_audio	file	Yes	Reference voice audio (10-30 seconds recommended). Max 20MB.
text	string	Yes	Text to speak in the cloned voice.
model	string	No	Clone model: `chatterbox` (default), `cosyvoice2`, `gpt-sovits`
format	string	No	Output format: `mp3` (default), `wav`, `flac`
language	string	No	Target language code. Must be supported by the chosen model.

Parameter	Type	Required	Description
file	file	Yes	Source audio file (MP3, WAV, FLAC). Max 50MB.
target_voice	string	Yes	Target voice ID to convert to (use `/v1/voices/` to list available voices)
model	string	No	Voice conversion model: `openvoice` (default), `knn-vc`
format	string	No	Output format: `wav` (default), `mp3`, `flac`

Parameter	Type	Required	Description
file	file	Yes	Source audio file in the original language. Max 100MB.
target_language	string	Yes	Target language code (e.g., `es`, `fr`, `de`, `ja`)
voice	string	No	Voice for translated output. Auto-selected if omitted.
preserve_voice	boolean	No	Attempt to preserve the original speaker's voice characteristics. Default: `false`

Parameter	Type	Required	Description
file	file	Yes	Source speech audio file. Max 50MB.
voice	string	Yes	Target voice ID for the output speech
model	string	No	Model: `openvoice` (default), `chatterbox`
emotion	string	No	Target emotion: `neutral`, `happy`, `sad`, `angry`, `excited`
speed	float	No	Speed adjustment. Default: `1.0`. Range: `0.5` to `2.0`

file file	Audio file to enhance
denoise boolean	Enable denoising (default: true)
enhance_clarity boolean	Enhance speech clarity (default: true)
super_resolution boolean	Upscale audio quality (default: false)
strength integer	1-3 (light, medium, strong). Default: 2

file file	Audio file to separate
model string	`demucs` (default) or `spleeter`
stems integer	Number of stems: 2, 4, 5, or 6 (default: 2)
format string	Output format: `wav`, `mp3`, `flac`

file file	Audio file to process
type string	`echo` or `reverb` (default: both)
intensity integer	1-5 (default: 3)

file file	Audio file to convert
format string	Target format: `mp3`, `wav`, `flac`, `ogg`, `m4a`, `aac`
bitrate integer	Output bitrate in kbps: 64, 128, 192, 256, 320
sample_rate integer	Sample rate: 22050, 44100, 48000
channels string	`mono` or `stereo`

Parameter	Type	Required	Description
audio	file	No*	Audio input (either `audio` or `text` required)
text	string	No*	Text input (either `audio` or `text` required)
voice	string	No	Voice for AI response. Default: `af_bella`
tts_model	string	No	TTS model for response. Default: `kokoro`
system_prompt	string	No	Custom system prompt for the AI
conversation_id	string	No	Continue an existing conversation

Parameter	Type	Description
texts	array	Array of objects: `{text, model, voice}`. Max 50 items.
webhook_url	string	Optional URL to POST results when batch completes.

Parameter	Type	Description
file	file	Reference audio file (WAV, MP3, FLAC).
model	string	Cloning model (default: chatterbox). Supported: chatterbox, cosyvoice2, openvoice, gpt-sovits, spark, indextts2, qwen3-tts.

Parameter	Type	Description
model	string	Filter by model ID (e.g., `kokoro`)
language	string	Filter by language code (e.g., `en`)
gender	string	Filter by gender: `male`, `female`, `neutral`

Parameter	Required	Description
uuid	Yes	Job UUID returned by /v1/tts/ or /v1/voice-clone/.
format	No	`srt` (default) or `vtt`.
download	No	`1` to send `Content-Disposition: attachment` so the browser saves rather than displays.
language	No	Hint to the alignment model (auto-detected if omitted).

Parameter	Type	Description
word	string	Word to override (e.g. `GIF`, `Anthropic`). Word-boundary matched.
replacement	string	How to spell it for the model (e.g. `jiff`, `ann THROP ick`).
language	string	Optional ISO code. Empty = applies to all languages.
case_sensitive	boolean	Default `false`. Match case exactly when `true`.