Artificial Vocal Learning guided by Phoneme Recognition and Visual
Information
- Paul Krug ,
- Peter Birkholz ,
- Branislav Gerazov ,
- Daniel Rudolph van Niekerk ,
- Anqi Xu ,
- Yi Xu
Abstract
This paper introduces a paradigm shift regarding vocal learning
simulations, in which the communicative function of speech acquisition
determines the learning process and intelligibility is considered the
main measure of learning success. Thereby, a novel approach for
artificial early vocal learning is presented that utilizes deep neural
network-based phoneme recognition in order to calculate the speech
acquisition objective function. This function guides a learning
framework that involves the state-of-the-art articulatory speech
synthesizer VocalTractLab as the motor-to-acoustic forward model. It is
shown that in this way an extensive set of German phonemes consisting of
most German consonants and all stressed vowels can be produced
successfully. The synthetic phonemes were rated as highly intelligible
by human listeners in a listening experiment. Furthermore, it is shown
that visual speech information, such as lip and jaw movements can be
extracted from video recordings and be incorporated into the learning
framework as an additional loss component during the optimization
process. It was observed that this visual loss did not increase the
overall intelligibility of phonemes. Instead, the visual loss acted as a
regularization mechanism that facilitated the finding of more
biologically plausible solutions in the articulatory domain.2023Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing volume 31 on pages 1734-1744. 10.1109/TASLP.2023.3264454