Sign in to use this tool
This tool may consume credits. Please sign in to continue.

Audio Transcription

Overview

Audio transcription converts audio and video files into text, supporting MP3, WAV, FLAC, AAC, OPUS, OGG, M4A, MP4, MPEG, MOV, and WebM. The output can be plain text, JSON, SRT subtitles, VTT subtitles, or detailed JSON with per-word timestamps — the choice of format has practical consequences for downstream use that are worth understanding before you submit.

Choosing an output format

Subtitle formats

  • SRT — the most compatible subtitle format; works in Premiere, Final Cut, CapCut, VLC, and PotPlayer
  • VTT — best for HTML5 <video> elements on websites

Text and data formats

  • Text — plain text for reading or pasting into a document
  • JSON — structured segments with start/end times, suited for scripted processing
  • Detailed JSON — adds word-level timestamps and speaker annotations; required for per-word timing or speaker diarization data

Word-level timestamps can only be enabled when "Detailed JSON" is selected. Speaker labels also produce their most complete data in Detailed JSON — in other formats the annotation may be partial.

Getting speaker labels right

Enable speaker labels when a recording contains more than one voice. Set the minimum and maximum speaker count to constrain the model: for a two-person interview, set both to 2; for a panel discussion with 5–8 participants, set minimum 3, maximum 8 or 10. Narrowing the range reduces mis-assignments when speakers take turns clearly.

Speaker diarization works best when voices are acoustically distinct and speakers do not interrupt each other frequently. Recordings where two people sound similar or overlap constantly will have lower labeling accuracy regardless of the range setting.

What the prompt field actually does

The prompt field is not a search filter. It tells the transcription model about vocabulary it is likely to encounter, which improves spelling and recognition of uncommon terms:

  • Technical terms and acronyms: WebAssembly, gRPC, CORS
  • Proper names: Satoshi Nakamoto, Cloudflare, Anthropic
  • Brief context: This is a podcast episode on TypeScript compiler internals

The prompt has no effect on output language and does not change which segments are transcribed.

Audio conditions that affect accuracy

Results are significantly better when:

  • Speech is clear with low background noise (a conference room beats a cafe)
  • The speaker's pace is moderate and pronunciation is distinct
  • One language is spoken throughout without code-switching

Accuracy drops noticeably with: heavy accents, very fast delivery, background music over voices, multiple people speaking simultaneously, or heavily compressed source audio (voice memos recorded at low bitrate).

The translation option

Enabling translation produces an English transcript even when the source audio is in another language. This is one-directional: it converts any language to English; it does not translate English audio into other languages. Translation results may diverge from a professional human translation, especially with idiomatic speech or technical content — review before publishing.