Text-to-Speech
Generate natural-sounding speech from text using AI voice models
Generate high-quality speech audio from text using our collection of state-of-the-art text-to-speech models from leading AI providers.
Create Speech
POST /audio/speech
Convert text to natural-sounding speech audio.
Request Body
The TTS model to use for speech generation (e.g., “tts-1”, “tts-1-hd”, “elevenlabs”)
Text Input Parameters
The text to convert to speech (1-4096 characters)
Voice Parameters
Voice name or ID for speech generation
Common Parameters
The audio format for the generated speech (“mp3”, “opus”, “aac”, “flac”, “wav”, “pcm”)
The speed of the generated audio (0.25 to 4.0 for most models, default: 1.0)
Temperature for randomness in speech generation (0.0 to 2.0)
Advanced Parameters
GPT-4o Mini TTS: Additional instructions to control voice characteristics
Dia model: Speaker transcript for enhanced voice control (max 1000 chars)
Dia model: CFG filter top k value (15-50)
Dia model: CFG scale value for generation control (1-5)
Microsoft TTS: Speech rate adjustment (-100 to 100, default: 0)
Microsoft TTS: Pitch adjustment (-100 to 100, default: 0)
Microsoft TTS: Emotional style (e.g., “cheerful”, “sad”, “angry”)
Response
Returns an audio file in the specified format.
Basic Example
Provider-Specific Examples
OpenAI Models (TTS-1, TTS-1 HD, GPT-4o Mini TTS)
ElevenLabs Models
Kokoro 82M Model
NariLabs Dia Model (Advanced)
MeloTTS Multilingual
PlayAI Dialog Models
Microsoft TTS
Available Models
OpenAI Models
TTS-1 (tts-1
)
- Optimized for real-time text-to-speech
- Cost-effective for most applications
- 11 available voices: alloy, ash, ballad, coral, echo, fable, onyx, nova, sage, shimmer, verse
TTS-1 HD (tts-1-hd
)
- High-quality text-to-speech
- Best audio quality with slower generation
- Same 11 voices as TTS-1
GPT-4o Mini TTS (gpt-4o-mini-tts
)
- Advanced TTS with instruction following
- Supports voice control via instructions
- Same 11 voices as TTS-1
Premium Models
ElevenLabs (elevenlabs
)
- Ultra-realistic human-like voices
- 40+ multilingual voices
- Supports English, Spanish, French, German, Arabic, Chinese, Hindi, Polish
PlayAI Dialog (playai-tts
)
- Specialized for conversational content
- 27 expressive voices
- Optimized for dialogue and storytelling
Specialized Models
Kokoro 82M (kokoro-82m
)
- Lightweight but high-quality
- 80+ voices in multiple languages
- Open-source Apache-licensed model
Microsoft TTS (microsoft-tts
)
- Enterprise-grade quality
- 100+ neural voices
- Extensive language support
Voice Examples
OpenAI Voices
ElevenLabs Voices
Microsoft TTS Voices
Model-Specific Parameters
Provider | Special Parameters | Usage |
---|---|---|
GPT-4o Mini TTS | instructions | Natural language voice control |
Dia 1.6B | speaker_transcript , cfg_scale , cfg_filter_top_k | Advanced voice conditioning |
Microsoft TTS | speech_rate , pitch_adjustment , emotional_style | Voice modulation and emotions |
MeloTTS | lang | Language selection (en, fr, es, etc.) |
All Models | speed , temperature , top_p | Common generation controls |
Advanced Features
Speed Control
Adjust playback speed for different use cases:
Audio Formats
Choose the optimal format for your application:
- MP3: Standard, widely compatible
- WAV: Uncompressed, highest quality
- OGG/Opus: Efficient compression
- FLAC: Lossless compression
- AAC: Good balance of quality and size
Instruction-Based Control (GPT-4o Mini TTS)
Control voice characteristics with natural language:
Microsoft TTS Advanced Controls
Microsoft TTS offers fine-grained control over speech characteristics:
Speech Rate Control
Pitch Adjustment
Emotional Styles
Different voices support different emotional styles:
Available Emotional Styles (voice-dependent):
cheerful
- Happy and upbeatsad
- Melancholic toneangry
- Frustrated or upsetfearful
- Nervous or scaredcalm
- Relaxed and peacefulgentle
- Soft and caringnewscast
- Professional news anchorcustomerservice
- Helpful and polite
Best Practices
Text Optimization
- Use clear punctuation for natural pauses
- Spell out numbers and abbreviations
- Use SSML tags for fine-grained control (model-dependent)
Voice Selection
- Customer Service: Professional voices (echo, George)
- Storytelling: Warm voices (fable, nova)
- Educational: Clear voices (alloy, shimmer)
- Gaming: Character voices (onyx, sage)
Performance Tips
- Cache generated audio when possible
- Use appropriate audio formats for your platform
- Consider real-time vs. high-quality models based on use case
Error Handling
Common error scenarios:
Use Cases
- Voice Assistants: Natural conversation interfaces
- Audiobooks: Long-form content narration
- E-learning: Educational content delivery
- Accessibility: Screen reader alternatives
- Gaming: Character voice generation
- Customer Service: Automated phone systems
- Content Creation: Podcast and video narration
Authorizations
Enter your API key (starts with 'ek-')
Body
Response
Audio file
The response is of type file
.