End-to-end TTS & VC
TTS
END-TO-END ADVERSARIAL TEXT-TO-SPEECH -- EATS ~ ICLR 2021 ~ DeepMind ~ repo
Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech -- VITS ~ ICML 2021 ~ Kakao ~ repo
WAVE-TACOTRON: SPECTROGRAM-FREE END-TO-END TEXT-TO-SPEECH SYNTHESIS -- Wave-Tacotron ~ ICASSP 2021 ~ Google
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis -- WaveGrad 2 ~ Interspeech 2021 ~ Google (Heiga Zen) ~ repo
EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture -- EFTS-Wav ~ ICML 2021 ~ PingAn (Chenfeng Miao) ~ repo
FASTSPEECH 2: FAST AND HIGH-QUALITY END-TO- END TEXT TO SPEECH -- Fastspeech2 ~ ICLR 2021 ~ Zhejiang U & Microsoft ~ repo
Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech -- Reinforce-Aligner ~ Interspeech 2021 ~ Korea University
Model | Institution | Published conference | Method | Principal | Field |
---|---|---|---|---|---|
EATS | DeepMind | ICLR 2021 | Aligner + GAN-TTS decoder | Adversial network | GAN |
VITS | Kakao | ICML 2021 | Glow-TTS + Hifi-GAN | GAN + Flow + VAE | GAN, Flow |
Wave-Tacotron | ICASSP 2021 | Tacotron + Flow | Seq2seq + Flow | Seq2seq, Flow | |
WaveGrad2 | Google (Heiga Zen) | Interspeech 2021 | Tacotron encoder + WaveGrad decoder | Diffusion | New |
EFTS-Wav | PingAn (Chenfeng Miao) | ICML 2021 | IMV aligner + Melgan | IMV | Non-AR |
Fastspeech 2 | Zhejiang U & Microsoft | ICLR 2021 | Duration predictor + Waveform decoder | CNN wav decoder | Non-AR |
Reinforce-Aligner | Korea U | Interspeech 2021 | RL aligner + MRF | RL | RL |
VC
Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation -- Parrotron ~ Interspeech 2019 ~ Google ~ repo
Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels -- F0-VC ~ WIP ~ Sorbonne
NVC-Net: End-to-End Adversarial Voice Conversion -- NVC-Net ~ UR ~ Sony
FRAGMENTVC: ANY-TO-ANY VOICE CONVERSION BY END-TO-END EXTRACTING AND FUSING FINE-GRAINED VOICE FRAGMENTS WITH ATTENTION -- FRAGMENTVC ~ ICASSP 2021 ~ Audio2Mel ~ NTU (Hung-yi Lee) ~ repo
Vocoder-free End-to-End Voice Conversion with Transformer Network -- Transformer-VC ~ WIP ~ Raw_spectrum - to - raw_spectrum ~ KNU
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations S2VC ~ Interspeech 2021 ~ self-supervised ~ NTU (Hung-yi Lee) ~ repo
Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion ~ flow-based end2end VC ~ NeurIPS2019 ~ Telefónica Research ~ repo
METRIC
SVSNet: An End-to-end Speaker Voice Similarity Assessment Model -- SVSNet ~ Similarity
Utilizing Self-supervised Representations for MOS Prediction -- MOS ~ NTU (Hung-yi Lee) ~ repo
MOSNet: Deep Learning-based Objective Assessment for Voice Conversion -- MOSNet ~ IISAS ~ repo
Deep Learning Based Assessment of Synthetic Speech Naturalness -- MOS ~ QUTTUB ~ repo
MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network -- MOS ~ ICASSP 2021 ~ USTC & Microsoft (Xu Tan) ~ repo
Others
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search -- LightSpeech ~ ICASSP 2021 ~ USTC & Microsoft (Xu Tan) ~ repo
Inspirations from ASR
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models -- End2end ASR ~ Interspeech 2021 ~ Hopkins & Yahoo Japan & CMU
A Streaming End-to-End Framework For Spoken Language Understanding -- StreamSLU ~ IJCAI 2021 ~ University of Waterloo & Huawei Noah’s Ark Lab & Tsinghua University
Tips:
UR: Under Review
WIP: Work in Progress
KAIST:Korea Advanced Institute of Science and Technology,韩国科学技术院
KNU:Kyungpook National University,韩国庆北大学
IISAS:Institute of Information Science, Academia Sinica, Taipei, Taiwan,台湾中央研究院信息科学研究所
QUTTUB:Quality and Usability Lab, Technische Universita ̈t Berlin, Berlin, Germany,德国柏林工业大学质量和可用性实验室