Transcription Guidelines

Standards for accurate voice transcription with timestamps and speaker diarization

Overview

The goal of transcription is to produce an accurate, time-aligned text representation of spoken audio with clear speaker identification. Each segment should capture a single speaker's turn with precise start and end timestamps.

Output Format

Each annotation segment follows this structure:

Format Template

HH:MM:SS,mmm --> HH:MM:SS,mmm [Speaker Name]
Transcribed text content goes here.

Component	Format	Example
Start Time	`HH:MM:SS,mmm`	`00:01:23,456`
End Time	`HH:MM:SS,mmm`	`00:01:35,789`
Speaker Label	`[Name]`	`[Martha]` or `[Speaker 1]`
Text	Verbatim transcription	The spoken words...

Key Requirements

1. Timestamp Accuracy

Millisecond precision: Timestamps should be accurate to the nearest 10-50ms when possible
Turn boundaries: Start time should align with speech onset, end time with speech offset
No overlaps: Segments should not overlap unless speakers are actually talking over each other
No gaps: Consecutive segments from the same speaker should be merged unless there's a meaningful pause

2. Speaker Diarization

Consistent labeling: Use the same label for the same speaker throughout
Clear identification: Use names if known (e.g., [Martha]), otherwise use [Speaker 1], [Speaker 2], etc.
Turn changes: Create a new segment whenever the speaker changes

3. Transcription Quality

Verbatim: Transcribe exactly what is said, including filler words (um, uh, like)
Punctuation: Use standard punctuation for readability
Non-speech sounds: Note relevant sounds in brackets, e.g., [chuckles], [noise], [sighs]
Unclear audio: Use [inaudible] or [unclear] for unintelligible speech
Interruptions: Use -- or ... to indicate cut-off speech

Example Transcription

Below is a properly formatted example showing timestamps, speaker labels, and transcription conventions:

00:00:01,440 --> 00:00:08,680 [Martha]
Hedwig, part one. Hi, and welcome to "The Real Weird Sisters." I'm Martha.

00:00:08,960 --> 00:00:10,140 [Alice]
And I'm Alice.

00:00:10,240 --> 00:00:21,610 [Martha]
And today, we're here. The day has finally arrived. We are here for our very first character study of our queen, Hedwig.

00:00:23,540 --> 00:00:34,320 [Alice]
Hold on. Sorry. I'm just gonna pause it 'cause it does feel like my input is pretty quiet on Audacity. Like, when I spoke, it was really low-looking. So let me just see if I can get it higher.

00:00:34,800 --> 00:00:35,320 [Martha]
Okay.

00:00:44,200 --> 00:00:44,900 [Alice]
Okay.

00:00:48,360 --> 00:00:50,220 [Martha]
I can always turn it up if I need to.

00:00:50,880 --> 00:00:58,700 [Alice]
Okay. [noise] All right.

00:01:05,140 --> 00:01:19,220 [Alice]
Yes, the queen, Hedwig. We are so excited. I think this is going to be the episode that we've all been waiting for. And there's so much to talk about that we, we figured we, we really shouldn't cram it all into one episode.

00:02:33,260 --> 00:02:34,200 [Alice]
Definitely not.

00:02:45,210 --> 00:02:47,239 [Alice]
We haven't done her justice at all.

00:03:24,280 --> 00:03:24,380 [Martha]
Mm-hmm.

00:03:24,400 --> 00:03:29,780 [Alice]
... in the appropriate head space to talk about this amazing person. I mean-

00:03:29,790 --> 00:03:30,140 [Martha]
And I [chuckles]-

00:03:30,150 --> 00:03:30,740 [Alice]
... bird.

Common Conventions

Scenario	Convention	Example
Laughter	Bracketed annotation	`[chuckles]`, `[laughs]`, `[both laugh]`
Sounds	Bracketed annotation	`[sighs]`, `[noise]`, `[clears throat]`
Interrupted speech	Hyphen at end	`And I was thinking-`
Trailing off	Ellipsis	`I don't know...`
Continuing after interruption	Ellipsis at start	`... and that's why I think`
Filler words	Include as spoken	`Um, uh, like, you know`
Emphasis	Quotation marks or italics context	`"very" important`

Quality Checklist

Every segment has accurate start and end timestamps
Speaker labels are consistent throughout
Text is verbatim (includes ums, uhs, repetitions)
Non-speech sounds are annotated in brackets
No overlapping segments (unless actual overlap)
Punctuation aids readability
Unclear portions are marked as [inaudible] or [unclear]

Back to Transcription Tool