Choosing an output format

SRT — the most compatible subtitle format; works in Premiere, Final Cut, CapCut, VLC, and PotPlayer
VTT — best for HTML5 <video> elements on websites

Text — plain text for reading or pasting into a document
JSON — structured segments with start/end times, suited for scripted processing
Detailed JSON — adds word-level timestamps and speaker annotations; required for per-word timing or speaker diarization data

Word-level timestamps can only be enabled when "Detailed JSON" is selected. Speaker labels also produce their most complete data in Detailed JSON — in other formats the annotation may be partial.

Getting speaker labels right

Enable speaker labels when a recording contains more than one voice. Set the minimum and maximum speaker count to constrain the model: for a two-person interview, set both to 2; for a panel discussion with 5–8 participants, set minimum 3, maximum 8 or 10. Narrowing the range reduces mis-assignments when speakers take turns clearly.

Speaker diarization works best when voices are acoustically distinct and speakers do not interrupt each other frequently. Recordings where two people sound similar or overlap constantly will have lower labeling accuracy regardless of the range setting.

What the prompt field actually does

The prompt field is not a search filter. It tells the transcription model about vocabulary it is likely to encounter, which improves spelling and recognition of uncommon terms:

Technical terms and acronyms: WebAssembly, gRPC, CORS

Proper names: Satoshi Nakamoto, Cloudflare, Anthropic

Brief context: This is a podcast episode on TypeScript compiler internals

The prompt has no effect on output language and does not change which segments are transcribed.

Audio conditions that affect accuracy

Results are significantly better when:

Speech is clear with low background noise (a conference room beats a cafe)

The speaker's pace is moderate and pronunciation is distinct

One language is spoken throughout without code-switching

Accuracy drops noticeably with: heavy accents, very fast delivery, background music over voices, multiple people speaking simultaneously, or heavily compressed source audio (voice memos recorded at low bitrate).

The translation option

Enabling translation produces an English transcript even when the source audio is in another language. This is one-directional: it converts any language to English; it does not translate English audio into other languages. Translation results may diverge from a professional human translation, especially with idiomatic speech or technical content — review before publishing.

Audio Transcription

Choosing an output format

Subtitle formats

Text and data formats

Getting speaker labels right

What the prompt field actually does

Audio conditions that affect accuracy

The translation option