AI Text-to-Speech turns written text into expressive spoken audio across 11 production-grade models — xAI, ElevenLabs, MiniMax, Inworld, Alibaba's Qwen3 (with voice cloning), and Nari Labs' Dia. The model picker matters: each one has its own voice catalog, language coverage, character limit, and per-1k-character price, and the first audible difference is usually intonation more than accent.
Pick the right model
Speech models
- xAI TTS — 6 voices, 20+ languages, supports inline
<pause>/<emphasis>tags - ElevenLabs v3 — studio-grade prosody, audio tags inside text, 70+ languages, 3,000-character limit
- ElevenLabs Multilingual v2 — workhorse for 29 languages, up to 10,000 characters
- ElevenLabs Flash / Turbo v2.5 — sub-second latency for chatbots, 32 languages, up to 40,000 characters
- Inworld Max / Mini — 75 named voices, expressive narration; Mini is cheaper and faster
Specialty models
- MiniMax Speech 2.8 — 300+ voices, language strongly biased via the language boost setting
- Qwen3 CustomVoice — 9 preset Alibaba voices with style control
- Qwen3 Base — voice cloning from a 3-second clip
- Dia 1.6B — multi-speaker English dialogue with
[laugh],[sigh], speaker tags
Voice cloning with Qwen3 Base
Qwen3 Base needs a 3–30 second reference clip. Two operating modes:
ICL mode (with transcript)
Provide both the audio clip and the exact transcript of what is said in it. Higher similarity, more natural prosody. Best for production work where the source clip is clean and you have the script handy.
x-vector mode (audio only)
Leave the transcript field empty. The model relies on a speaker embedding only — quicker to set up, but the clone is less faithful and may drift on long outputs. Good for quick experiments.
Why ElevenLabs has no voice picker
ElevenLabs models in this catalog use the platform's default voice for each model — the per-voice ID parameter is not exposed on this provider. You can still tune the result with the four sliders under "Advanced":
- Stability — lower = more emotional range, more variation between runs; higher = consistent monotone narration
- Similarity — how closely the output adheres to the underlying voice; raise it on Multilingual v2 if the voice drifts mid-paragraph
- Style — exaggerates the voice's natural style; bumps latency at high values
- Speaker boost — slight clarity bump at a small latency cost (Flash/Turbo do not expose this)
Inline tags worth remembering
xAI TTS and Dia both honor inline tags inside the text. ElevenLabs v3 supports a richer set of audio tags. Some examples that survive across providers:
[Captain] (laughs) Tell me that was the last drone.
[Navigator] Last drone? No. Last polite warning? Absolutely.
Welcome to the observatory. <pause time="600ms"/> The comet streaks across the sky like a silver flame, <emphasis level="strong">brilliant</emphasis> and brief.
What drives the bill
All TTS models in this catalog charge per 1,000 input characters. The displayed price tag on the model picker is the per-1k rate; total cost scales linearly with text.length. A few practical implications:
- Pasting a 20,000-character chapter into ElevenLabs Flash costs ~20× a one-line caption.
- The credit hold is sized to the text length you submit — short prompts hold a small reservation; long ones reserve more, and the final settlement matches what the provider actually charged.
- Dia's listed price is for production use; in this catalog it is billed at roughly the same rate as Qwen3.
Reading back the output
The download button on each result respects the format you selected (MP3 / WAV / FLAC / OGG) and tags the filename accordingly. History entries also remember the format that produced them, so re-downloading an older clip will not silently change the extension.