BC challenge 2019 Top5队伍技术分析

	Frontend	Duration Modelling	Spectrogram modelling	Vocoder	Features
USTC-iflytek 科大讯飞	Tasks: special marks procession, polyphones classification, breaks prediction focuses prediction. Methodoly: Bidirectional Encoders Representations from Transformers (BERT)-based multi-task models	LSTM-RNN models autoregressive model structure,	A statistical parametric speech system (SPSS) GAN-based multi- task acoustic modeling Fundamental frequency (F0), 41 dimensional mel-cepstra (M- CEP), band aperiodicity (BAP) were adopted as the acoustic features	Wavenet The acoustic feature used was the joint feature vector of Mel-cepstrum, F0 and the u/v decision. Multi-speaker dataset for argumentation	Text-side: Manual annotations: Pinyin(with tone), PW, PP, and focus position Speech-side: Frame-level acoustic features:
DeepSound 深声科技	Tasks: text normalization, qingsheng, sandhi and erhua, : rule-based G2P: Bi-LSTM prosody prediction, PW, PPH, IPH: Bi-LSTM BiLSTM-based recurrent network (RNN) is used in the G2P module for polyphone and prosody prediction.	/	VQVAE. + a embedding+prenet oper- ation + GAN based postfiltering (robust on the unclean dataset )	robust multi-speaker neural vocoder conditioned on the mel spectrograms	manual and auto- matic tagging operations: phoneme, tone, prosody and pause duration
腾讯	Festival front-end to predict phoneme, tone and other linguistic features + BERT sentence embeddings are generated by a pre-trained Bert model.	/	A multi-speaker model is trained first.	multi-speaker model trained first. Wavenet	linguistic feature (The HTS full-context label) and sentence embedding mel spectrograms + channel embedding
灵伴	text normalization, word segmentation, part-of-speech tagging, phonetic disambiguation word segmentation of the sentence, Part-of-Speeches (POS) of this word sequence and prosodic hierarchy	/	DNN-LSTM	Wavenet ground-truth mel-spectrograms plus F0	spectral envelope, fundamental frequency (F0), contextual labels (phone-related and word-related features)
Horizon 南京团队	The corresponding texts were manually embedded into 476-dimensional vectors using our own text an- alyzing system. The embedded vectors consisted of one-hot encoded phonemes, tones, part-of-speech, prosodic boundaries and the position information. Prosody boundary: phoneme boundaries, syllable boundaries, phrase boundaries, secondary phrase boundaries		DCTTS[14] and Deep Voice 3[13]	WaveRNN