This tool may consume credits. Please sign in to continue.

Text to Speech

10 credits per 1,000 characters

Overview

The text-to-speech tool converts written text into natural-sounding audio, supporting 9 languages with multiple voices per language. Type or paste up to 10,000 characters, choose a voice, and download the result as MP3, WAV, FLAC, AAC, OGG, or Opus.

Picking a Voice Before You Generate

Every voice has a preview button that plays a short sample (about 3 seconds) without consuming credits. Use it before committing to a full synthesis, especially for Chinese and Japanese where male and female voices have very different character. Generating and then re-generating because the voice wasn't right costs double the credits.

Audio Format by Use Case

Small file size, easy to share or embed
MP3 has the broadest device compatibility
AAC offers slightly better quality at the same bitrate
OGG is open-source but unsupported on some older devices

Lossless or uncompressed — large file size
WAV and FLAC are the right choice when you need to edit the audio afterward
FLAC is about 50% smaller than WAV at the same quality
PCM is raw sample data; most media players cannot play it directly

Speed Range and Clarity

Speed goes from 0.5x (very slow) to 4.0x (extremely fast). 1.3–1.5x is usually the comfortable upper limit for narration audio meant for listening. Above 2.0x, articulation degrades noticeably across all voices and languages. If you need a faster pace for a specific project, test at 1.8x before committing to higher values.

Word Timestamps (English Only)

Enabling word timestamps returns the start and end time of every word alongside the audio. During playback, the transcript highlights the current word in sync. This is useful for creating follow-along captions, language-learning players, or embedding the audio in a page that needs text-audio synchronization. The option is grayed out for all non-English languages.

Character Count and Credits

The limit is 10,000 characters per generation. Each character counts as one — individual Chinese or Japanese characters, individual Latin letters, and numbers all count the same. Credits are charged per 1,000 characters. A 10,000-character synthesis is roughly 5,000 Chinese characters or about 1,800 English words.