Utterance segmentation appears to be entirely independent of (the features used for) speaker labeling. Specifically, it was noticed that even though the speaker labeling correctly identifies that a new speaker (very clear because it goes from male voice to female voice) after (what is transcribed as) `seven's` and before (what is transcribed as) `to` in `three sixty eight reduce speed to two one zero then descend and maintain three thousand advise seven's to one zero thousand three thousand US air six eighty six ` - that entire utterance is chunked together - this leads to errors because the 'to' should be transcribed as 'two' (and would have been had it been an utterance initial word.
Why is it useful?
|Who would benefit from this IDEA?||As a customer transcribing mulit-speaker speech, I want the transcritiption to be as accurate and correctly segmented as possible|
How should it work?