Sign in to use this tool
This tool may consume credits. Please sign in to continue.

AI Audio to Audio

Overview

AI Audio to Audio reimagines an existing track in a new style — keeping the melody and turning it into a different genre, voice, or arrangement. Two model families share the same UI: MiniMax Music Cover for full song-to-song style transformations, and ACE-Step v1.5 (Base / Turbo) for music generation that can also accept a reference audio for cover or remix work.

Source audio rules

MiniMax Music Cover requires source audio between 6 seconds and 6 minutes. ACE-Step accepts source audio optionally — when present, the model treats it as a remix seed and the output length follows the source clip rather than the duration slider. The clip's duration is read from file metadata before upload, so files that fail to decode cannot be submitted.

Lyrics, Verse tags, and instrumental output

Both MiniMax and ACE-Step accept a structured lyrics field with section tags. The provider does not infer structure from prose paragraphs; the brackets are required.

[Intro]
[Verse]
Wheels in circles on a painted line
Neon streaks and a borrowed shine
[Chorus]
Glide with me through the afterglow
Where the silver speakers throb real low
[Bridge]
[Outro]

For MiniMax covers that should keep the original lyrics, the conventional pattern is to write the section skeleton plus a short hint that asks the model to retain the source vocal:

[Intro]
[Verse]
Keep the original lyrics and phrasing from the source vocal.
[Chorus]
Keep the original lyrics and phrasing from the source vocal.

ACE-Step generates instrumentals when the lyrics field is empty (or contains only structural tags). Set the vocal language to "Instrumental / Auto" under Advanced for cleaner instrumental results.

ACE-Step advanced parameters

When source audio is present

  • Strength — fraction of denoising steps that follow the source. 0 ignores the source, 1 sticks to it. Start at 0.5 for noticeable but creative changes
  • Cover conditioning — how much of the original song's structure is preserved. Higher values keep the source recognizable
  • The duration slider is hidden — output length follows the source clip

Without source audio

  • Duration sets the output length (6–300 seconds, default 60)
  • Strength and cover conditioning have no effect and are disabled in the UI
  • Steps controls fineness of detail; Base allows up to 300 (default 100), Turbo up to 20 (default 10)
  • CFG Scale governs how closely the prompt is followed. Must be greater than 1 if you provide a negative prompt — the server auto-bumps it to 1.5 if needed

Crafting the style prompt

A useful style prompt for music generation reads like a music director's note rather than a poetry caption. List the elements you want to hear:

Late-70s funk-pop cover with a bright female lead, tight disco drums, elastic bassline, crisp rhythm guitar, brass stabs, sparkling synth accents, dramatic breakdown, triumphant final chorus.
Lo-fi hip-hop, jazzy electric piano chords, mellow boom-bap drums at 88 BPM, vinyl crackle, late-night focus mood, no vocals.

Include the BPM in the prompt and in the BPM slider for better adherence. Include the vocal language explicitly when generating with lyrics; otherwise ACE-Step defaults to English.

Cost and credit holds

MiniMax Music Cover is flat-rated per generation regardless of input length. ACE-Step is billed by the duration of the generated track:

  • Without source audio, the credit hold is sized to the duration slider
  • With source audio, the hold is sized from the measured source length so a 4-minute clip reserves enough credits even when the duration field is hidden
  • Final settlement matches what the provider actually charged via the cost reported on each task

Listening back and downloading

Each generated track plays back inline. The download button respects the format you selected (MP3 / WAV / FLAC / OGG) and the history panel keeps each result with its original format, so re-downloading from history will not silently change the extension. The seed value shown next to a result lets you reproduce or tweak a generation by adjusting only one parameter.