Speech and voice conditions can alter the acoustic properties of speech,
which could impact the performance of paralinguistic models for affect for
people with atypical speech. We evaluate publicly available models for
recognizing categorical and dimensional affect from speech on a dataset of
atypical speech, comparing results to datasets of typical speech. We
investigate three dimensions of speech atypicality: intelligibility, which is
related to pronounciation; monopitch, which is related to prosody, and
harshness, which is related to voice quality. We look at (1) distributional
trends of categorical affect predictions within the dataset, (2) distributional
comparisons of categorical affect predictions to similar datasets of typical
speech, and (3) correlation strengths between text and speech predictions for
spontaneous speech for valence and arousal. We find that the output of affect
models is significantly impacted by the presence and degree of speech
atypicalities. For instance, the percentage of speech predicted as sad is
significantly higher for all types and grades of atypical speech when compared
to similar typical speech datasets. In a preliminary investigation on improving
robustness for atypical speech, we find that fine-tuning models on
pseudo-labeled atypical speech data improves performance on atypical speech
without impacting performance on typical speech. Our results emphasize the need
for broader training and evaluation datasets for speech emotion models, and for
modeling approaches that are robust to voice and speech differences.
Este artículo explora los viajes en el tiempo y sus implicaciones.
Descargar PDF:
2504.16283v1