Deep Phonetics

12.09.2025 | Heinz Nixdorf Institute, Communications Engineering / Heinz Nixdorf Institute

The speech signal is a rich source of information that conveys not only linguistic but also para-linguistic information such as identity, gender, emotional state or age. However, these features are hidden in complex, non-transparent variations of the speech signal. The project "Deep Phonetics" of our Communications Engineering workgroup is investigating precisely these para-linguistic dimensions and getting to the bottom of the following questions:

How can features such as voice, speech tempo or emotions be disentangled and specifically changed? And how do people perceive these changes?

The aims of this project were to,

(i) to use artificial intelligence methods to disentangle different dimensions of the speech signal and

(ii) to better understand how these dimensions are perceived by humans.

Based on deep generative models, methods were first developed that can disentangle and manipulate the voice, speech rate and selected dimensions of voice quality. The results showed that individual dimensions can be manipulated independently of each other and that the simultaneous manipulation of several dimensions leads to a higher similarity to a desired target speaker.

Furthermore, the workgroup fine-tuned the evaluation of speech synthesis and found that not only the common constructs "audio quality" and "intelligibility", but also paralinguistic constructs such as "speech tempo", "speaker origin" and "friendliness" are essential in the human evaluation of speech synthesis. With the increasing influence of AI agents in our everyday lives, these findings can help to contribute to the acceptance of AI voices.