AI Text-to-Speech turns written text into expressive spoken audio across 11 production-grade models — xAI, ElevenLabs, MiniMax, Inworld, Alibaba's Qwen3 (with voice cloning), and Nari Labs' Dia. The model picker matters: each one has its own voice catalog, language coverage, character limit, and per-1k-character price, and the first audible difference is usually intonation more than accent.

Pick the right model

xAI TTS — 6 voices, 20+ languages, supports inline <pause> / <emphasis> tags
ElevenLabs v3 — studio-grade prosody, audio tags inside text, 70+ languages, 3,000-character limit
ElevenLabs Multilingual v2 — workhorse for 29 languages, up to 10,000 characters
ElevenLabs Flash / Turbo v2.5 — sub-second latency for chatbots, 32 languages, up to 40,000 characters
Inworld Max / Mini — 75 named voices, expressive narration; Mini is cheaper and faster

MiniMax Speech 2.8 — 300+ voices, language strongly biased via the language boost setting
Qwen3 CustomVoice — 9 preset Alibaba voices with style control
Qwen3 Base — voice cloning from a 3-second clip
Dia 1.6B — multi-speaker English dialogue with [laugh], [sigh], speaker tags

Voice cloning with Qwen3 Base

Qwen3 Base needs a 3–30 second reference clip. Two operating modes:

Provide both the audio clip and the exact transcript of what is said in it. Higher similarity, more natural prosody. Best for production work where the source clip is clean and you have the script handy.

Leave the transcript field empty. The model relies on a speaker embedding only — quicker to set up, but the clone is less faithful and may drift on long outputs. Good for quick experiments.

Why ElevenLabs has no voice picker

ElevenLabs models in this catalog use the platform's default voice for each model — the per-voice ID parameter is not exposed on this provider. You can still tune the result with the four sliders under "Advanced":

Stability — lower = more emotional range, more variation between runs; higher = consistent monotone narration
Similarity — how closely the output adheres to the underlying voice; raise it on Multilingual v2 if the voice drifts mid-paragraph
Style — exaggerates the voice's natural style; bumps latency at high values
Speaker boost — slight clarity bump at a small latency cost (Flash/Turbo do not expose this)

Inline tags worth remembering

xAI TTS and Dia both honor inline tags inside the text. ElevenLabs v3 supports a richer set of audio tags. Some examples that survive across providers:

[Captain] (laughs) Tell me that was the last drone.
[Navigator] Last drone? No. Last polite warning? Absolutely.

Welcome to the observatory. <pause time="600ms"/> The comet streaks across the sky like a silver flame, <emphasis level="strong">brilliant</emphasis> and brief.

What drives the bill

All TTS models in this catalog charge per 1,000 input characters. The displayed price tag on the model picker is the per-1k rate; total cost scales linearly with text.length. A few practical implications:

Pasting a 20,000-character chapter into ElevenLabs Flash costs ~20× a one-line caption.
The credit hold is sized to the text length you submit — short prompts hold a small reservation; long ones reserve more, and the final settlement matches what the provider actually charged.
Dia's listed price is for production use; in this catalog it is billed at roughly the same rate as Qwen3.

Reading back the output

The download button on each result respects the format you selected (MP3 / WAV / FLAC / OGG) and tags the filename accordingly. History entries also remember the format that produced them, so re-downloading an older clip will not silently change the extension.

AI Text-to-Speech

Pick the right model

Speech models

Specialty models

Voice cloning with Qwen3 Base

ICL mode (with transcript)

x-vector mode (audio only)

Why ElevenLabs has no voice picker

Inline tags worth remembering

What drives the bill

Reading back the output