CLTTS: A Cooperative Learning Strategy for Non-Autoregressive Text-to-Speech

Wei Zhao

doi:10.36227/techrxiv.24572962.v1

loading page

CLTTS: A Cooperative Learning Strategy for Non-Autoregressive Text-to-Speech

Wei Zhao

Abstract

Non-autoregressive text-to-speech (TTS) has recently received a lot of atten-tion due to its reliability and fast reasoning. Despite its outstanding achieve-ment, non-autoregressive speech synthesis still faces some critical challenges. A major issue is that non-autoregressive methods necessitate an external toolkit to align the speech with the transcript, thus substantially complicat-ing the process of building the model. Besides, non-autoregressive methods suffer from the one-to-many mapping issue, where the same transcript may correspond to speech in numerous styles. This problem may harm the ex-pressiveness of the generated speech because the model tends to provide output with an average style. To address the above issues, this paper pro-poses a cooperative learning strategy for non-autoregressive speech synthe-sis. Specifically, the suggested method employs both an autoregressive and a non-autoregressive TTS model during the training procedure. The autore-gressive model is trained as a partner at each iteration, providing essential alignment information and also the prosody embedding of the speech to the non-autoregressive model. After receiving the above useful knowledge, the non-autoregressive model can be further trained without relying on external alignment tools. Meanwhile, the prosody embedding from the autoregressive model and the pitch information from the raw audio can be utilised together to alleviate the one-to-many mapping problem. Experimental results demon-strate that our approach can produce comparable speech to the most popular FastSpeech 2 model while drastically reducing the complexity of constructing a non-autoregressive TTS model.