TY - JOUR
T1 - Transformer-based language-independent gender recognition in noisy audio environments
AU - Anidjar, Or Haim
AU - Yozevitch, Roi
N1 - Publisher Copyright:
© The Author(s) 2025.
PY - 2025/12
Y1 - 2025/12
N2 - This study proposes an independent method for identifying the gender of the speaker from an audio clip in a noisy environment. In this paper are performed two different processes on audio clips: one as a Mel-Spectrogram and the other using the Wav2Vec2 acoustic model emission, examining the advantages and disadvantages of each method. A series of experiments are presented across five different languages-English, Arabic, Spanish, French, and Russian-containing male and female audio clips. An analysis of these languages is carried out, examining their independent characteristics against a five-language model. The goal of this study is to distinguish the gender of the speaker based on an audio clip, regardless of language or complex background noise such as nightclubs or stadiums. Additionally, this research addresses the critical issue of gender bias in voice recognition systems. It highlights the challenges posed by the over-representation of male voices in training datasets and the subsequent impact on the accuracy and fairness of gender classification, particularly for female voices. The approach in this paper involves maintaining an equivalent quantity of audio clips for both male and female voices to ensure balance and mitigate this bias. The experimental results indicate that the performance evaluation of the traditional spectrogram method achieved better results compared to the Wav2Vec transformer method. For the Russian language, the spectrogram method achieved an accuracy of 99%, while the Wav2Vec transformer1 method achieved only 89% accuracy. Tests in various environments-noisy and silent-show that a model trained in both conditions exhibited better accuracy. The results also indicate that a model trained on data from a wide variety of languages yielded higher results. The research findings highlight important insights for developing more reliable, accurate, and equitable systems in acoustic gender detection.
AB - This study proposes an independent method for identifying the gender of the speaker from an audio clip in a noisy environment. In this paper are performed two different processes on audio clips: one as a Mel-Spectrogram and the other using the Wav2Vec2 acoustic model emission, examining the advantages and disadvantages of each method. A series of experiments are presented across five different languages-English, Arabic, Spanish, French, and Russian-containing male and female audio clips. An analysis of these languages is carried out, examining their independent characteristics against a five-language model. The goal of this study is to distinguish the gender of the speaker based on an audio clip, regardless of language or complex background noise such as nightclubs or stadiums. Additionally, this research addresses the critical issue of gender bias in voice recognition systems. It highlights the challenges posed by the over-representation of male voices in training datasets and the subsequent impact on the accuracy and fairness of gender classification, particularly for female voices. The approach in this paper involves maintaining an equivalent quantity of audio clips for both male and female voices to ensure balance and mitigate this bias. The experimental results indicate that the performance evaluation of the traditional spectrogram method achieved better results compared to the Wav2Vec transformer method. For the Russian language, the spectrogram method achieved an accuracy of 99%, while the Wav2Vec transformer1 method achieved only 89% accuracy. Tests in various environments-noisy and silent-show that a model trained in both conditions exhibited better accuracy. The results also indicate that a model trained on data from a wide variety of languages yielded higher results. The research findings highlight important insights for developing more reliable, accurate, and equitable systems in acoustic gender detection.
KW - Automatic speech recognition
KW - Language independent gender recognition
KW - Wav2Vec 2.0
UR - http://www.scopus.com/inward/record.url?scp=105003814316&partnerID=8YFLogxK
U2 - 10.1038/s41598-025-99011-x
DO - 10.1038/s41598-025-99011-x
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:105003814316
SN - 2045-2322
VL - 15
JO - Scientific Reports
JF - Scientific Reports
IS - 1
M1 - 14421
ER -