TY - JOUR
T1 - Speech and multilingual natural language framework for speaker change detection and diarization
AU - Anidjar, Or Haim
AU - Estève, Yannick
AU - Hajaj, Chen
AU - Dvir, Amit
AU - Lapidot, Itshak
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2023/3/1
Y1 - 2023/3/1
N2 - Speaker Change Detection (SCD) is the problem of splitting an audio-recording by its speaker-turns. Many real-world problems, such as the Speaker Diarization (SD) or automatic speech transcription, are influenced by the quality of the speaker-turns estimation. Previous works have already shown that auxiliary textual information (for mono-lingual systems) can be of great use for detection of speaker-turns and the diarization systems’ performance. In this paper, we suggest a framework for speaker-turn estimation, as well as the determination of clustered speaker identities to the SD system, and examine our approach over a multi-lingual dataset that consists of three mono-lingual datasets—in English, French, and Hebrew. As such, we propose a generic and language-independent framework for the SCD problem that is learned through textual information using state-of-the-art transformer-based techniques and speech-embedding modules. Comprehensive experimental evaluation shows that (i) our multi-lingual SCD framework is competitive enough when compared to a framework over mono-lingual datasets, and that (ii) textual information improves the solution's quality compared to the speech signal-based approach. In addition, we show that our multi-lingual SCD approach does not harm the performance of SD systems.
AB - Speaker Change Detection (SCD) is the problem of splitting an audio-recording by its speaker-turns. Many real-world problems, such as the Speaker Diarization (SD) or automatic speech transcription, are influenced by the quality of the speaker-turns estimation. Previous works have already shown that auxiliary textual information (for mono-lingual systems) can be of great use for detection of speaker-turns and the diarization systems’ performance. In this paper, we suggest a framework for speaker-turn estimation, as well as the determination of clustered speaker identities to the SD system, and examine our approach over a multi-lingual dataset that consists of three mono-lingual datasets—in English, French, and Hebrew. As such, we propose a generic and language-independent framework for the SCD problem that is learned through textual information using state-of-the-art transformer-based techniques and speech-embedding modules. Comprehensive experimental evaluation shows that (i) our multi-lingual SCD framework is competitive enough when compared to a framework over mono-lingual datasets, and that (ii) textual information improves the solution's quality compared to the speech signal-based approach. In addition, we show that our multi-lingual SCD approach does not harm the performance of SD systems.
KW - Speaker change detection
KW - Speaker diarization
KW - Speaker embedding
KW - Speech recognition
KW - Transformers
UR - http://www.scopus.com/inward/record.url?scp=85142151662&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2022.119238
DO - 10.1016/j.eswa.2022.119238
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85142151662
SN - 0957-4174
VL - 213
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 119238
ER -