API Documentation

Integrate TTS.ai into your applications with our REST API. OpenAI-compatible format for easy migration.

REST API OpenAI Compatible JSON Responses Streaming Support

Overview

The TTS.ai API provides programmatic access to all platform features: text-to-speech synthesis, speech-to-text transcription, voice cloning, audio enhancement, and more. The API uses standard REST conventions with JSON request/response bodies.

API Key

Get your API key from Account Settings. Available on all plans, including free accounts.

Base URL

https://api.tts.ai/v1/

Auth

Bearer token via Authorization header

Authentication

Free tier — no key required. Anonymous POSTs to /v1/tts/ work without any auth, up to 5,000 characters/day per IP, using any of our free models (piper, vits, melotts, kokoro). Sign up for a free account to get 15,000 bonus characters and access to premium models.

For premium models and higher rate limits, authenticate with a Bearer token in the Authorization header.

HTTP Header
Authorization: Bearer sk-tts-your-api-key-here
Keep your API key secret. Do not share it in client-side code, public repositories, or logs. Rotate keys regularly from your account settings.

SDKs

Official SDKs make it easy to integrate TTS.ai into your application. Both are open source and available on GitHub.

Python

pip install ttsai
from tts_ai import TTSClient

client = TTSClient(api_key="sk-tts-...")
audio = client.generate(
    text="Hello world!",
    model="kokoro"
)
client.save(audio, "output.wav")
GitHub

JavaScript / Node.js

npm install @ttsainpm/ttsai
const { TTSClient } = require('@ttsainpm/ttsai');

const client = new TTSClient({
  apiKey: 'sk-tts-...'
});
const audio = await client.generate({
  input: 'Hello world!',
  model: 'kokoro'
});
await client.saveToFile(audio, 'output.wav');
GitHub

Base URL

Base URL: https://api.tts.ai/v1/

All endpoints are relative to this base URL. For example, the TTS endpoint is:

POST https://api.tts.ai/v1/tts/

Rate Limits

API rate limits vary by plan:

Plan Requests/min Concurrent Max Text Length
Free 10 2 500 chars
Starter 30 3 1,000,000 chars
Pro 60 5 1,000,000 chars
Business+ 300 20 50,000 chars

Rate limit headers are included in every response: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.

Character Usage

Service Cost Unit
TTS (Free models: Piper, VITS, MeloTTS) 1,000 characters per 1,000 characters
TTS (Standard models: Kokoro, CosyVoice 2, etc.) 2,000 characters per 1,000 characters
TTS (Premium models: Tortoise, Chatterbox, etc.) 4,000 characters per 1,000 characters
Speech to Text 2,000 characters per minute of audio
Voice Cloning 4,000 characters per 1,000 characters
Voice Changer 3,000 characters per minute of audio
Audio Enhancement 2,000 characters per minute of audio
Vocal Removal / Stem Splitting 3,000-4,000 characters per minute of audio
Speech Translation 5,000 characters per minute of audio
Voice Chat 3,000 characters per turn
Key & BPM Finder Free --
Audio Converter Free --

Text to Speech

POST /v1/tts/

Convert text to speech audio. Returns audio file in the requested format.

Request Body

ParameterTypeRequiredDescription
model string No Model ID (e.g., kokoro, chatterbox, piper). If omitted, we auto-pick a model that supports the requested languagekokoro for en/ja/zh/ko/fr/de/it/pt/es/hi/ru, piper for other supported languages (ar/pl/nl/cs/da/fi/el/hu/tr/uk/vi/etc.).
text string Yes Text to convert to speech. Per-request cap: 500 chars (anonymous), 5,000 (free account), 1,000,000 (paid plan). Long inputs are auto-chunked server-side.
voice string Yes Voice ID (use /v1/voices/ to list available voices)
format string No Output format: mp3 (default), wav, flac, ogg
speed float No Speaking speed multiplier. Default: 1.0. Range: 0.5 to 2.0
language string No Language code (e.g., en, es). Auto-detected if omitted.
instructions string No Acting / delivery cues (≤500 chars). e.g. \
pronunciations object | array No Per-request pronunciation overrides. Either {\
stream boolean No Enable streaming response. Default: false

Example Request

cURL
curl -X POST https://api.tts.ai/v1/tts/ \
  -H "Authorization: Bearer sk-tts-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "text": "Hello from TTS.ai! This is a test.",
    "voice": "af_bella",
    "format": "mp3"
  }' \
  --output output.mp3

SSML tags

Wrap numbers, dates, currency, phone numbers, and acronyms in

interpret-asInputSpoken as
cardinal1234one thousand two hundred thirty-four
ordinal21twenty-first
date1999-12-31December thirty-first, nineteen ninety-nine
time14:30two thirty PM
telephone+1-555-867-5309plus one five five five eight six seven…
currency$1,234.56one thousand two hundred thirty-four dollars and fifty-six cents
spell-outNASAN A S A

Date format defaults to mdy for English and dmy elsewhere; override with format=\

Example
{
  "model": "kokoro",
  "voice": "af_bella",
  "text": "Your appointment is on <say-as interpret-as=\"date\">2026-04-26</say-as> at <say-as interpret-as=\"time\">14:30</say-as>. Please call <say-as interpret-as=\"telephone\">+1-555-867-5309</say-as> if you need to reschedule."
}

Response

The TTS endpoint queues your request and returns a JSON response with a job UUID. You then poll for the result.

Step 1: Submit request

Response (JSON)
{
  "uuid": "77b71db532874ce98e84a69a2d740d4c",
  "job_id": "f21316bb-aefa-480d-8523-701d1e3184ce",
  "status": "queued",
  "credits_used": 11,
  "credits_remaining": 15000
}

Step 2: Poll for result

GET /v1/speech/results/?uuid=<job_uuid>

Poll this endpoint every 1-2 seconds until status is completed or failed.

Polling response (completed)
{
  "status": "completed",
  "result_url": "https://api.tts.ai/static/downloads/77b71db5.../output.mp3"
}
Polling response (still processing)
{
  "status": "processing"
}

Step 3: Download audio

Fetch the result_url from the completed response to download the audio file.

Full example

Python
import requests, time

API_KEY = "sk-tts-your-key"
BASE = "https://api.tts.ai"

# 1. Submit TTS request
resp = requests.post(f"{BASE}/v1/tts/", json={
    "model": "kokoro",
    "text": "Hello from TTS.ai!",
    "voice": "af_bella"
}, headers={"Authorization": f"Bearer {API_KEY}"})
data = resp.json()
uuid = data["uuid"]

# 2. Poll for result
while True:
    result = requests.get(f"{BASE}/v1/speech/results/",
        params={"uuid": uuid}).json()
    if result["status"] == "completed":
        # 3. Download audio
        audio = requests.get(result["result_url"])
        with open("output.mp3", "wb") as f:
            f.write(audio.content)
        break
    elif result["status"] == "failed":
        raise Exception(result.get("error", "Generation failed"))
    time.sleep(1.5)

Streaming alternative: For supported models (Kokoro, MeloTTS), use POST /v1/tts/stream/ for real-time Server-Sent Events (SSE) streaming — no polling needed.

Speech to Text

POST /v1/stt/

Transcribe audio to text. Supports 99 languages with auto-detection.

Request Body (multipart/form-data)

ParameterTypeRequiredDescription
file file Yes Audio file (MP3, WAV, FLAC, OGG, M4A, MP4, WebM). Max 100MB.
model string No STT model: whisper (default), faster-whisper, sensevoice
language string No Language code. auto for auto-detection (default).
timestamps boolean No Include word-level timestamps. Default: false
diarize boolean No Enable speaker diarization. Default: false

Response

JSON Response
{
  "text": "Hello, this is a transcription test.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "start": 0.0,
      "end": 1.8,
      "text": "Hello, this is",
      "speaker": "SPEAKER_00"
    },
    {
      "start": 1.8,
      "end": 3.5,
      "text": "a transcription test.",
      "speaker": "SPEAKER_00"
    }
  ]
}

Voice Cloning

POST /v1/tts/clone/

Generate speech in a cloned voice. Upload a reference audio and text.

Request Body (multipart/form-data)

ParameterTypeRequiredDescription
reference_audio file Yes Reference voice audio (10-30 seconds recommended). Max 20MB.
text string Yes Text to speak in the cloned voice.
model string No Clone model: chatterbox (default), cosyvoice2, gpt-sovits
format string No Output format: mp3 (default), wav, flac
language string No Target language code. Must be supported by the chosen model.

Response

Returns the audio file as binary data, same as the TTS endpoint.

Voice Changer

POST /v1/voice-convert/

Convert audio to sound like a different voice. Upload source audio and choose a target voice.

Request Body (multipart/form-data)

ParameterTypeRequiredDescription
file file Yes Source audio file (MP3, WAV, FLAC). Max 50MB.
target_voice string Yes Target voice ID to convert to (use /v1/voices/ to list available voices)
model string No Voice conversion model: openvoice (default), knn-vc
format string No Output format: wav (default), mp3, flac

Example Request

cURL
curl -X POST https://api.tts.ai/v1/voice-convert/ \
  -H "Authorization: Bearer sk-tts-your-key" \
  -F "file=@source_audio.mp3" \
  -F "target_voice=af_bella" \
  -F "model=openvoice" \
  -o converted.wav

Response

Returns the converted audio file as binary data.

Speech Translation

POST /v1/speech-translate/

Translate spoken audio from one language to another. Combines speech-to-text, translation, and text-to-speech in a single call.

Request Body (multipart/form-data)

ParameterTypeRequiredDescription
file file Yes Source audio file in the original language. Max 100MB.
target_language string Yes Target language code (e.g., es, fr, de, ja)
voice string No Voice for translated output. Auto-selected if omitted.
preserve_voice boolean No Attempt to preserve the original speaker's voice characteristics. Default: false

Response

JSON Response
{
  "original_text": "Hello, how are you?",
  "translated_text": "Hola, como estas?",
  "source_language": "en",
  "target_language": "es",
  "audio_url": "https://api.tts.ai/v1/results/translate_abc123.mp3",
  "credits_used": 5
}

Speech to Speech

POST /v1/speech-to-speech/

Transform speech style, emotion, or delivery while keeping the content. Useful for adjusting tone, pacing, and expressiveness.

Request Body (multipart/form-data)

ParameterTypeRequiredDescription
file file Yes Source speech audio file. Max 50MB.
voice string Yes Target voice ID for the output speech
model string No Model: openvoice (default), chatterbox
emotion string No Target emotion: neutral, happy, sad, angry, excited
speed float No Speed adjustment. Default: 1.0. Range: 0.5 to 2.0

Response

Returns the transformed audio file as binary data.

Audio Tools

Audio processing endpoints for enhancement, vocal removal, stem splitting, and more.

POST /v1/audio/enhance/

Enhance audio quality: denoise, improve clarity, super resolution.

file fileAudio file to enhance
denoise booleanEnable denoising (default: true)
enhance_clarity booleanEnhance speech clarity (default: true)
super_resolution booleanUpscale audio quality (default: false)
strength integer1-3 (light, medium, strong). Default: 2
POST /v1/audio/separate/

Separate vocals from instrumentals (vocal removal) or split into stems.

file fileAudio file to separate
model stringdemucs (default) or spleeter
stems integerNumber of stems: 2, 4, 5, or 6 (default: 2)
format stringOutput format: wav, mp3, flac
POST /v1/audio/dereverb/

Remove echo and reverb from audio recordings.

file fileAudio file to process
type stringecho or reverb (default: both)
intensity integer1-5 (default: 3)
POST /v1/audio/analyze/ Free

Analyze audio to detect key, BPM, and time signature.

file fileAudio file to analyze
Response
{
  "key": "C",
  "scale": "Major",
  "bpm": 120.0,
  "time_signature": "4/4",
  "camelot": "8B",
  "compatible_keys": ["C Major", "G Major", "F Major", "A Minor"]
}
POST /v1/audio/convert/ Free

Convert audio between formats.

file fileAudio file to convert
format stringTarget format: mp3, wav, flac, ogg, m4a, aac
bitrate integerOutput bitrate in kbps: 64, 128, 192, 256, 320
sample_rate integerSample rate: 22050, 44100, 48000
channels stringmono or stereo

Voice Chat

POST /v1/voice-chat/

Send audio or text and receive an AI response with synthesized speech.

Request Body (multipart/form-data or JSON)

ParameterTypeRequiredDescription
audio file No* Audio input (either audio or text required)
text string No* Text input (either audio or text required)
voice string No Voice for AI response. Default: af_bella
tts_model string No TTS model for response. Default: kokoro
system_prompt string No Custom system prompt for the AI
conversation_id string No Continue an existing conversation

Response

JSON Response
{
  "conversation_id": "conv_abc123",
  "user_text": "What is the capital of France?",
  "ai_text": "The capital of France is Paris.",
  "audio_url": "https://api.tts.ai/v1/audio/tmp/resp_xyz.mp3",
  "credits_used": 3
}

Batch TTS

POST /v1/tts/batch/

Submit multiple texts for parallel TTS generation. Optionally receive a webhook callback when all jobs complete.

Parameters

ParameterTypeDescription
textsarrayArray of objects: {text, model, voice}. Max 50 items.
webhook_urlstringOptional URL to POST results when batch completes.

Response

JSON Response
{
  "batch_id": "abc123",
  "total": 3,
  "completed": 0,
  "status": "processing"
}

Poll progress with GET /v1/tts/batch/result/?batch_id=abc123

Voice Embedding

POST /v1/voice-embed/

Pre-compute a voice embedding from reference audio. Use the returned embed_id in subsequent voice cloning requests for near-instant generation.

Parameters

ParameterTypeDescription
filefileReference audio file (WAV, MP3, FLAC).
modelstringCloning model (default: chatterbox). Supported: chatterbox, cosyvoice2, openvoice, gpt-sovits, spark, indextts2, qwen3-tts.

Response

JSON Response
{
  "embed_id": "emb_abc123",
  "model": "chatterbox",
  "duration_ms": 450
}

Health Check

GET /v1/health/

Check GPU server status, loaded models, and queue size. No authentication required. Cached for 30 seconds.

Response

JSON Response
{
  "status": "online",
  "latency_ms": 45,
  "queue_size": 3,
  "models_loaded": ["kokoro", "chatterbox", "cosyvoice2"]
}

List Models

GET /v1/models/

Returns a list of all available models with their capabilities.

Response

JSON Response
{
  "models": [
    {
      "id": "kokoro",
      "name": "Kokoro",
      "type": "tts",
      "tier": "standard",
      "languages": ["en", "ja", "ko", "zh", "fr"],
      "supports_cloning": false,
      "supports_streaming": true,
      "credits_per_1k_chars": 2
    },
    {
      "id": "chatterbox",
      "name": "Chatterbox",
      "type": "tts",
      "tier": "premium",
      "languages": ["en"],
      "supports_cloning": true,
      "supports_streaming": true,
      "credits_per_1k_chars": 4
    }
  ]
}

List Voices

GET /v1/voices/

Returns a list of all available voices, optionally filtered by model or language.

Query Parameters

ParameterTypeDescription
model string Filter by model ID (e.g., kokoro)
language string Filter by language code (e.g., en)
gender string Filter by gender: male, female, neutral

Response

JSON Response
{
  "voices": [
    {
      "id": "af_bella",
      "name": "Bella",
      "model": "kokoro",
      "language": "en",
      "gender": "female",
      "preview_url": "https://api.tts.ai/v1/voices/preview/af_bella.mp3"
    }
  ],
  "total": 142
}

Subtitles (SRT / VTT) new

GET /v1/speech/subtitles/?uuid=<job_uuid>&format=srt|vtt&download=1

Generate synchronised subtitles for any completed TTS job. Runs Whisper alignment over the audio and returns SRT or WebVTT. Result is cached on disk so a second call for the same uuid is a disk read.

Query Parameters

ParameterRequiredDescription
uuidYesJob UUID returned by /v1/tts/ or /v1/voice-clone/.
formatNosrt (default) or vtt.
downloadNo1 to send Content-Disposition: attachment so the browser saves rather than displays.
languageNoHint to the alignment model (auto-detected if omitted).
cURL
curl "https://api.tts.ai/v1/speech/subtitles/?uuid=$UUID&format=srt&download=1" -o subtitles.srt

Pronunciation Dictionary new

GET POST DELETE /api/v1/pronunciations/

Tell the TTS engine how to pronounce specific words. Saved entries auto-apply to every TTS request you make. 200-entry per-account limit.

Request Body (POST)

ParameterTypeDescription
wordstringWord to override (e.g. GIF, Anthropic). Word-boundary matched.
replacementstringHow to spell it for the model (e.g. jiff, ann THROP ick).
languagestringOptional ISO code. Empty = applies to all languages.
case_sensitivebooleanDefault false. Match case exactly when true.
cURL
# Save an entry
curl -X POST https://tts.ai/api/v1/pronunciations/ \
  -H "Authorization: Bearer sk-tts-..." \
  -H "Content-Type: application/json" \
  -d '{"word": "GIF", "replacement": "jiff"}'

# List your entries
curl https://tts.ai/api/v1/pronunciations/ -H "Authorization: Bearer sk-tts-..."

# Delete entry by id
curl -X DELETE "https://tts.ai/api/v1/pronunciations/?id=42" -H "Authorization: Bearer sk-tts-..."

You can also pass per-request overrides without saving them — include pronunciations on any /v1/tts/ call as either an object or an array (see the TTS endpoint params).

Article Narrator new

Drop a single