Abstract
Non-autoregressive text-to-speech (TTS) has recently received a lot of
atten-tion due to its reliability and fast reasoning. Despite its
outstanding achieve-ment, non-autoregressive speech synthesis still
faces some critical challenges. A major issue is that non-autoregressive
methods necessitate an external toolkit to align the speech with the
transcript, thus substantially complicat-ing the process of building the
model. Besides, non-autoregressive methods suffer from the one-to-many
mapping issue, where the same transcript may correspond to speech in
numerous styles. This problem may harm the ex-pressiveness of the
generated speech because the model tends to provide output with an
average style. To address the above issues, this paper pro-poses a
cooperative learning strategy for non-autoregressive speech synthe-sis.
Specifically, the suggested method employs both an autoregressive and a
non-autoregressive TTS model during the training procedure. The
autore-gressive model is trained as a partner at each iteration,
providing essential alignment information and also the prosody embedding
of the speech to the non-autoregressive model. After receiving the above
useful knowledge, the non-autoregressive model can be further trained
without relying on external alignment tools. Meanwhile, the prosody
embedding from the autoregressive model and the pitch information from
the raw audio can be utilised together to alleviate the one-to-many
mapping problem. Experimental results demon-strate that our approach can
produce comparable speech to the most popular FastSpeech 2 model while
drastically reducing the complexity of constructing a non-autoregressive
TTS model.