USTC-iflytek 科大讯飞 |
Tasks: special marks procession, polyphones classification, breaks prediction focuses prediction. Methodoly: Bidirectional Encoders Representations from Transformers (BERT)-based multi-task models |
LSTM-RNN models autoregressive model structure, |
A statistical parametric speech system (SPSS) GAN-based multi- task acoustic modeling Fundamental frequency (F0), 41 dimensional mel-cepstra (M- CEP), band aperiodicity (BAP) were adopted as the acoustic features |
Wavenet The acoustic feature used was the joint feature vector of Mel-cepstrum, F0 and the u/v decision. Multi-speaker dataset for argumentation |
Text-side: Manual annotations: Pinyin(with tone), PW, PP, and focus position Speech-side: Frame-level acoustic features: |
DeepSound 深声科技 |
Tasks: text normalization, qingsheng, sandhi and erhua, : rule-based G2P: Bi-LSTM prosody prediction, PW, PPH, IPH: Bi-LSTM BiLSTM-based recurrent network (RNN) is used in the G2P module for polyphone and prosody prediction. |
/ |
VQVAE. + a embedding+prenet oper- ation + GAN based postfiltering (robust on the unclean dataset ) |
robust multi-speaker neural vocoder conditioned on the mel spectrograms |
manual and auto- matic tagging operations: phoneme, tone, prosody and pause duration |
腾讯 |
Festival front-end to predict phoneme, tone and other linguistic features + BERT sentence embeddings are generated by a pre-trained Bert model. |
/ |
A multi-speaker model is trained first. |
multi-speaker model trained first. Wavenet |
linguistic feature (The HTS full-context label) and sentence embedding mel spectrograms + channel embedding |
灵伴 |
text normalization, word segmentation, part-of-speech tagging, phonetic disambiguation word segmentation of the sentence, Part-of-Speeches (POS) of this word sequence and prosodic hierarchy |
/ |
DNN-LSTM |
Wavenet ground-truth mel-spectrograms plus F0 |
spectral envelope, fundamental frequency (F0), contextual labels (phone-related and word-related features) |
Horizon 南京团队 |
The corresponding texts were manually embedded into 476-dimensional vectors using our own text an- alyzing system. The embedded vectors consisted of one-hot encoded phonemes, tones, part-of-speech, prosodic boundaries and the position information. Prosody boundary: phoneme boundaries, syllable boundaries, phrase boundaries, secondary phrase boundaries |
|
DCTTS[14] and Deep Voice 3[13] |
WaveRNN |
|
|
|
|
|
|
|