Transcription Guidelines

Standards for accurate voice annotation with timestamps and speaker diarization

Overview

The goal of transcription is to produce an accurate, time-aligned text representation of spoken audio with clear speaker identification. Each segment should capture a single speaker's turn with precise start and end timestamps.

Output Format

Each annotation segment follows this structure:

Format Template

HH:MM:SS,mmm --> HH:MM:SS,mmm [Speaker Name]
Transcribed text content goes here.
Component Format Example
Start Time HH:MM:SS,mmm 00:01:23,456
End Time HH:MM:SS,mmm 00:01:35,789
Speaker Label [Name] [Martha] or [Speaker 1]
Text Verbatim transcription The spoken words...

Key Requirements

1. Timestamp Accuracy

2. Speaker Diarization

3. Transcription Quality

Example Transcription

Below is a properly formatted example showing timestamps, speaker labels, and transcription conventions:

00:00:01,440 --> 00:00:08,680 [Martha]
Hedwig, part one. Hi, and welcome to "The Real Weird Sisters." I'm Martha.

00:00:08,960 --> 00:00:10,140 [Alice]
And I'm Alice.

00:00:10,240 --> 00:00:21,610 [Martha]
And today, we're here. The day has finally arrived. We are here for our very first character study of our queen, Hedwig.

00:00:23,540 --> 00:00:34,320 [Alice]
Hold on. Sorry. I'm just gonna pause it 'cause it does feel like my input is pretty quiet on Audacity. Like, when I spoke, it was really low-looking. So let me just see if I can get it higher.

00:00:34,800 --> 00:00:35,320 [Martha]
Okay.

00:00:44,200 --> 00:00:44,900 [Alice]
Okay.

00:00:48,360 --> 00:00:50,220 [Martha]
I can always turn it up if I need to.

00:00:50,880 --> 00:00:58,700 [Alice]
Okay. [noise] All right.

00:01:05,140 --> 00:01:19,220 [Alice]
Yes, the queen, Hedwig. We are so excited. I think this is going to be the episode that we've all been waiting for. And there's so much to talk about that we, we figured we, we really shouldn't cram it all into one episode.

00:02:33,260 --> 00:02:34,200 [Alice]
Definitely not.

00:02:45,210 --> 00:02:47,239 [Alice]
We haven't done her justice at all.

00:03:24,280 --> 00:03:24,380 [Martha]
Mm-hmm.

00:03:24,400 --> 00:03:29,780 [Alice]
... in the appropriate head space to talk about this amazing person. I mean-

00:03:29,790 --> 00:03:30,140 [Martha]
And I [chuckles]-

00:03:30,150 --> 00:03:30,740 [Alice]
... bird.

Common Conventions

Scenario Convention Example
Laughter Bracketed annotation [chuckles], [laughs], [both laugh]
Sounds Bracketed annotation [sighs], [noise], [clears throat]
Interrupted speech Hyphen at end And I was thinking-
Trailing off Ellipsis I don't know...
Continuing after interruption Ellipsis at start ... and that's why I think
Filler words Include as spoken Um, uh, like, you know
Emphasis Quotation marks or italics context "very" important

Quality Checklist

Back to Annotation Tool