The text-to-speech tool converts written text into natural-sounding audio, supporting 9 languages with multiple voices per language. Type or paste up to 10,000 characters, choose a voice, and download the result as MP3, WAV, FLAC, AAC, OGG, or Opus.
Picking a Voice Before You Generate
Every voice has a preview button that plays a short sample (about 3 seconds) without consuming credits. Use it before committing to a full synthesis, especially for Chinese and Japanese where male and female voices have very different character. Generating and then re-generating because the voice wasn't right costs double the credits.
Audio Format by Use Case
MP3 / AAC / OGG
- Small file size, easy to share or embed
- MP3 has the broadest device compatibility
- AAC offers slightly better quality at the same bitrate
- OGG is open-source but unsupported on some older devices
WAV / FLAC / PCM
- Lossless or uncompressed — large file size
- WAV and FLAC are the right choice when you need to edit the audio afterward
- FLAC is about 50% smaller than WAV at the same quality
- PCM is raw sample data; most media players cannot play it directly
Speed Range and Clarity
Speed goes from 0.5x (very slow) to 4.0x (extremely fast). 1.3–1.5x is usually the comfortable upper limit for narration audio meant for listening. Above 2.0x, articulation degrades noticeably across all voices and languages. If you need a faster pace for a specific project, test at 1.8x before committing to higher values.
Word Timestamps (English Only)
Enabling word timestamps returns the start and end time of every word alongside the audio. During playback, the transcript highlights the current word in sync. This is useful for creating follow-along captions, language-learning players, or embedding the audio in a page that needs text-audio synchronization. The option is grayed out for all non-English languages.
Character Count and Credits
The limit is 10,000 characters per generation. Each character counts as one — individual Chinese or Japanese characters, individual Latin letters, and numbers all count the same. Credits are charged per 1,000 characters. A 10,000-character synthesis is roughly 5,000 Chinese characters or about 1,800 English words.