0%

End-to-end TTS & VC 文章总结

发表于 2021-07-20 更新于 2021-08-19 分类于工作

TTS领域近些年的研究大多重点放在多阶段的模型建模和训练，如TTS分为前端文本处理、声学模型、声码器，而VC分为Audio2Mel，Mel2mel。我本人认为，端到端的模型才是TTS / VC领域的重点，而end2end任务的难点主要在于声码器的合并，即对于高分辨率的语音采样点的降采样特征抽取和建模。近些年，一些学者和研究成员在end2end TTS 与end2end VC上发表了相关文章。因此本文加以总结和收纳，以推进在TTS和VC领域End2end的可能性。

End-to-end TTS & VC

TTS

END-TO-END ADVERSARIAL TEXT-TO-SPEECH -- EATS ~ ICLR 2021 ~ DeepMind ~ repo

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech -- VITS ~ ICML 2021 ~ Kakao ~ repo

WAVE-TACOTRON: SPECTROGRAM-FREE END-TO-END TEXT-TO-SPEECH SYNTHESIS -- Wave-Tacotron ~ ICASSP 2021 ~ Google

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis -- WaveGrad 2 ~ Interspeech 2021 ~ Google (Heiga Zen) ~ repo

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture -- EFTS-Wav ~ ICML 2021 ~ PingAn (Chenfeng Miao) ~ repo

FASTSPEECH 2: FAST AND HIGH-QUALITY END-TO- END TEXT TO SPEECH -- Fastspeech2 ~ ICLR 2021 ~ Zhejiang U & Microsoft ~ repo

Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech -- Reinforce-Aligner ~ Interspeech 2021 ~ Korea University

Model	Institution	Published conference	Method	Principal	Field
EATS	DeepMind	ICLR 2021	Aligner + GAN-TTS decoder	Adversial network	GAN
VITS	Kakao	ICML 2021	Glow-TTS + Hifi-GAN	GAN + Flow + VAE	GAN, Flow
Wave-Tacotron	Google	ICASSP 2021	Tacotron + Flow	Seq2seq + Flow	Seq2seq, Flow
WaveGrad2	Google (Heiga Zen)	Interspeech 2021	Tacotron encoder + WaveGrad decoder	Diffusion	New
EFTS-Wav	PingAn (Chenfeng Miao)	ICML 2021	IMV aligner + Melgan	IMV	Non-AR
Fastspeech 2	Zhejiang U & Microsoft	ICLR 2021	Duration predictor + Waveform decoder	CNN wav decoder	Non-AR
Reinforce-Aligner	Korea U	Interspeech 2021	RL aligner + MRF	RL	RL

VC

Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation -- Parrotron ~ Interspeech 2019 ~ Google ~ repo

Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels -- F0-VC ~ WIP ~ Sorbonne

NVC-Net: End-to-End Adversarial Voice Conversion -- NVC-Net ~ UR ~ Sony

FRAGMENTVC: ANY-TO-ANY VOICE CONVERSION BY END-TO-END EXTRACTING AND FUSING FINE-GRAINED VOICE FRAGMENTS WITH ATTENTION -- FRAGMENTVC ~ ICASSP 2021 ~ Audio2Mel ~ NTU (Hung-yi Lee) ~ repo

Vocoder-free End-to-End Voice Conversion with Transformer Network -- Transformer-VC ~ WIP ~ Raw_spectrum - to - raw_spectrum ~ KNU

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations S2VC ~ Interspeech 2021 ~ self-supervised ~ NTU (Hung-yi Lee) ~ repo

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion ~ flow-based end2end VC ~ NeurIPS2019 ~ Telefónica Research ~ repo

METRIC

SVSNet: An End-to-end Speaker Voice Similarity Assessment Model -- SVSNet ~ Similarity

Utilizing Self-supervised Representations for MOS Prediction -- MOS ~ NTU (Hung-yi Lee) ~ repo

MOSNet: Deep Learning-based Objective Assessment for Voice Conversion -- MOSNet ~ IISAS ~ repo

Deep Learning Based Assessment of Synthetic Speech Naturalness -- MOS ~ QUTTUB ~ repo

MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network -- MOS ~ ICASSP 2021 ~ USTC & Microsoft (Xu Tan) ~ repo

Others

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search -- LightSpeech ~ ICASSP 2021 ~ USTC & Microsoft (Xu Tan) ~ repo

Inspirations from ASR

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models -- End2end ASR ~ Interspeech 2021 ~ Hopkins & Yahoo Japan & CMU

A Streaming End-to-End Framework For Spoken Language Understanding -- StreamSLU ~ IJCAI 2021 ~ University of Waterloo & Huawei Noah’s Ark Lab & Tsinghua University

Tips:

UR: Under Review
WIP: Work in Progress
KAIST：Korea Advanced Institute of Science and Technology，韩国科学技术院
KNU：Kyungpook National University，韩国庆北大学
IISAS：Institute of Information Science, Academia Sinica, Taipei, Taiwan，台湾中央研究院信息科学研究所
QUTTUB：Quality and Usability Lab, Technische Universita ̈t Berlin, Berlin, Germany，德国柏林工业大学质量和可用性实验室