Speaker-attributed transcription from the clustered backend
(silero VAD + ECAPA-TDNN + agglomerative clustering) aligned to
faster-whisper words. Click any word or the timeline to seek; the
transcript follows the audio. Speaker count is estimated, not
given. Toggle show reference to see the ground-truth turns
outlined under the predictions; gaps between filled and outlined bands
are the diarization errors the benchmark scores.
Audio is synthetic conversation built from LibriSpeech dev-clean
(read speech, no overlap), the same seeded mixtures the
benchmark
scores. Serve locally with python -m http.server -d demo if not viewing on GitHub Pages.