Sign in to use this tool
This tool may consume credits. Please sign in to continue.

AI Voice Cloning

Overview

AI Voice Cloning takes a reference audio clip and transfers its vocal timbre to new text, producing synthesized speech that sounds like the original speaker. Upload a recording, enter what you want it to say, and download the resulting audio file.

How Reference Audio Affects Quality

The reference recording is the single most important variable — it directly determines how closely the output voice matches:

  • Aim for 5–30 seconds of audio; shorter than 3 seconds produces unstable results
  • Single speaker, quiet environment, no echo — background noise or reverb gets carried into the output
  • Keep speaking speed and volume steady; avoid extreme pitch shifts — the model learns average characteristics across the whole clip
  • Accepted formats: MP3, WAV, M4A, OGG

Text Length and How to Split Long Scripts

The text field accepts up to 2000 characters. For anything longer, submit in separate chunks manually.

When the same reference audio is used across multiple generations, pauses and intonation can differ slightly between segments. For long-form audio, keep each chunk under 500 characters and join the segments afterward in an audio editor.

What the Reference Text Field Does

"Reference Text" is a written transcript of what was said in the uploaded audio clip — it's optional. Providing it helps the service understand the pronunciation patterns in the reference clip, which improves voice consistency when the audio has a non-native accent or many pauses. This field is especially useful when the reference audio language differs from the output text language.

Writing Effective Style Instructions

A style instruction is a short phrase describing the desired tone and emotion — for example, "calm and professional, suitable for narration" or "energetic and upbeat, suitable for advertising."

  • Keep it brief and specific — one sentence is enough
  • Contradictory descriptions ("relaxed yet formal") produce inconsistent output
  • Style instructions affect delivery and pacing, not the timbre itself — the voice always comes from the reference audio

Supported Languages

The tool supports 10 languages, listed in the language selector. Choosing the language that matches your output text helps the service handle pronunciation and stress rules correctly. The reference audio language does not have to match the output language — cross-language voice transfer is supported, though accent characteristics will shift somewhat.