海阔天空蓝

一个玩过n种运动的语音合成算法攻城狮

0%

End-to-end TTS & VC 文章总结

End-to-end TTS & VC

TTS

END-TO-END ADVERSARIAL TEXT-TO-SPEECH -- EATS ~ ICLR 2021 ~ DeepMind ~ repo

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech -- VITS ~ ICML 2021 ~ Kakao ~ repo

WAVE-TACOTRON: SPECTROGRAM-FREE END-TO-END TEXT-TO-SPEECH SYNTHESIS -- Wave-Tacotron ~ ICASSP 2021 ~ Google

WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis -- WaveGrad 2 ~ Interspeech 2021 ~ Google (Heiga Zen) ~ repo

EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture -- EFTS-Wav ~ ICML 2021 ~ PingAn (Chenfeng Miao) ~ repo

FASTSPEECH 2: FAST AND HIGH-QUALITY END-TO- END TEXT TO SPEECH -- Fastspeech2 ~ ICLR 2021 ~ Zhejiang U & Microsoft ~ repo

Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech -- Reinforce-Aligner ~ Interspeech 2021 ~ Korea University

Model Institution Published conference Method Principal Field
EATS DeepMind ICLR 2021 Aligner + GAN-TTS decoder Adversial network GAN
VITS Kakao ICML 2021 Glow-TTS + Hifi-GAN GAN + Flow + VAE GAN, Flow
Wave-Tacotron Google ICASSP 2021 Tacotron + Flow Seq2seq + Flow Seq2seq, Flow
WaveGrad2 Google (Heiga Zen) Interspeech 2021 Tacotron encoder + WaveGrad decoder Diffusion New
EFTS-Wav PingAn (Chenfeng Miao) ICML 2021 IMV aligner + Melgan IMV Non-AR
Fastspeech 2 Zhejiang U & Microsoft ICLR 2021 Duration predictor + Waveform decoder CNN wav decoder Non-AR
Reinforce-Aligner Korea U Interspeech 2021 RL aligner + MRF RL RL

VC

Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation -- Parrotron ~ Interspeech 2019 ~ Google ~ repo

Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels -- F0-VC ~ WIP ~ Sorbonne

NVC-Net: End-to-End Adversarial Voice Conversion -- NVC-Net ~ UR ~ Sony

FRAGMENTVC: ANY-TO-ANY VOICE CONVERSION BY END-TO-END EXTRACTING AND FUSING FINE-GRAINED VOICE FRAGMENTS WITH ATTENTION -- FRAGMENTVC ~ ICASSP 2021 ~ Audio2Mel ~ NTU (Hung-yi Lee) ~ repo

Vocoder-free End-to-End Voice Conversion with Transformer Network -- Transformer-VC ~ WIP ~ Raw_spectrum - to - raw_spectrum ~ KNU

S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations S2VC ~ Interspeech 2021 ~ self-supervised ~ NTU (Hung-yi Lee) ~ repo

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion ~ flow-based end2end VC ~ NeurIPS2019 ~ Telefónica Research ~ repo

METRIC

SVSNet: An End-to-end Speaker Voice Similarity Assessment Model -- SVSNet ~ Similarity

Utilizing Self-supervised Representations for MOS Prediction -- MOS ~ NTU (Hung-yi Lee) ~ repo

MOSNet: Deep Learning-based Objective Assessment for Voice Conversion -- MOSNet ~ IISAS ~ repo

Deep Learning Based Assessment of Synthetic Speech Naturalness -- MOS ~ QUTTUB ~ repo

MBNet: MOS Prediction for Synthesized Speech with Mean-Bias Network -- MOS ~ ICASSP 2021 ~ USTC & Microsoft (Xu Tan) ~ repo

Others

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search -- LightSpeech ~ ICASSP 2021 ~ USTC & Microsoft (Xu Tan) ~ repo

Inspirations from ASR

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models -- End2end ASR ~ Interspeech 2021 ~ Hopkins & Yahoo Japan & CMU

A Streaming End-to-End Framework For Spoken Language Understanding -- StreamSLU ~ IJCAI 2021 ~ University of Waterloo & Huawei Noah’s Ark Lab & Tsinghua University

Tips:

  • UR: Under Review

  • WIP: Work in Progress

  • KAIST:Korea Advanced Institute of Science and Technology,韩国科学技术院

  • KNU:Kyungpook National University,韩国庆北大学

  • IISAS:Institute of Information Science, Academia Sinica, Taipei, Taiwan,台湾中央研究院信息科学研究所

  • QUTTUB:Quality and Usability Lab, Technische Universita ̈t Berlin, Berlin, Germany,德国柏林工业大学质量和可用性实验室