海阔天空蓝

一个玩过n种运动的语音合成算法攻城狮

0%

BC challenge 2019 Top5队伍 技术分析

Frontend Duration Modelling Spectrogram modelling Vocoder Features
USTC-iflytek 科大讯飞 Tasks: special marks procession, polyphones classification, breaks prediction focuses prediction. Methodoly: Bidirectional Encoders Representations from Transformers (BERT)-based multi-task models LSTM-RNN models autoregressive model structure, A statistical parametric speech system (SPSS) GAN-based multi- task acoustic modeling Fundamental frequency (F0), 41 dimensional mel-cepstra (M- CEP), band aperiodicity (BAP) were adopted as the acoustic features Wavenet The acoustic feature used was the joint feature vector of Mel-cepstrum, F0 and the u/v decision. Multi-speaker dataset for argumentation Text-side: Manual annotations: Pinyin(with tone), PW, PP, and focus position Speech-side: Frame-level acoustic features:
DeepSound 深声科技 Tasks: text normalization, qingsheng, sandhi and erhua, : rule-based G2P: Bi-LSTM prosody prediction, PW, PPH, IPH: Bi-LSTM BiLSTM-based recurrent network (RNN) is used in the G2P module for polyphone and prosody prediction. / VQVAE. + a embedding+prenet oper- ation + GAN based postfiltering (robust on the unclean dataset ) robust multi-speaker neural vocoder conditioned on the mel spectrograms manual and auto- matic tagging operations: phoneme, tone, prosody and pause duration
腾讯 Festival front-end to predict phoneme, tone and other linguistic features + BERT sentence embeddings are generated by a pre-trained Bert model. / A multi-speaker model is trained first. multi-speaker model trained first. Wavenet linguistic feature (The HTS full-context label) and sentence embedding mel spectrograms + channel embedding
灵伴 text normalization, word segmentation, part-of-speech tagging, phonetic disambiguation word segmentation of the sentence, Part-of-Speeches (POS) of this word sequence and prosodic hierarchy / DNN-LSTM Wavenet ground-truth mel-spectrograms plus F0 spectral envelope, fundamental frequency (F0), contextual labels (phone-related and word-related features)
Horizon 南京团队 The corresponding texts were manually embedded into 476-dimensional vectors using our own text an- alyzing system. The embedded vectors consisted of one-hot encoded phonemes, tones, part-of-speech, prosodic boundaries and the position information. Prosody boundary: phoneme boundaries, syllable boundaries, phrase boundaries, secondary phrase boundaries DCTTS[14] and Deep Voice 3[13] WaveRNN