Zero/One/Few-shot TTS的所有论文

研究Few-shot TTS有一段时间了，想要系统化的了解一下这个行业的发展情况，因此在这里总结一下所有有关Zero/One/Few-shot TTS / Voice clone的相关论文。

Motivation: 采集尽可能少量目标说话人的声音，克隆目标说话人的音色，制作tts引擎，重点需支持中文。

Github List:

[x] 实时中文语音克隆 Mocking Bird Code
[x] Neural Voice Cloning with a Few Samples (Sercan Ö. Arık, Baidu, 2018 Dec.) Code1 | Code2 | Paper
[x] Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (Jia Ye, Google, 2019 Jan.) Code | Paper
[x] One model to speak them all (Mutian He, 港科技 & 微软 2021 Jul.) Code | Paper
[x] Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation (Dongchan Min, KAIST, 2021 Jun, ICML) Code | Paper
[x] SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model (Edresson Casanova, University of Sa ̃o Paulo, 2021 Jun, Interspeech 2021)
[x] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone (Edresson Casanova, Universidade de Sa ̃o Paulo, 2022 Feb ) Code | Paper
[x] Zero-shot VITS Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus Code | Paper
[x] AdaSpeech Code | Paper || AdaSpeech2 Code | Paper
Deepfakes for Video Conferencing Using General Adversarial Networks (GANs) and Multilingual Voice Cloning Code
[x] Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning (Rui Li, CloudMinds Inc, ICASSP 2022) Code | Paper
[x] TorToiSe (jbetker, April 2022) Code | Intro | No training methodology

Other Large Audio Models:
https://github.com/liusongxiang/Large-Audio-Models

Prompt-based Audio Synthesis

[x] NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers(2023), Kai Shen et al. [PDF]
[x] FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model(2023), Ruiqing Xue et al. [PDF]
[x] VALL-E X: Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (2023), Ziqiang Zhang et al. [PDF]
Noise2Music: Text-conditioned Music Generation with Diffusion Models(2023), Qingqing Huang et al. [PDF]
[x] Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision(2023), Eugene Kharitonov et al. [PDF]
MusicLM: Generating Music From Text(2023), Andrea Agostinelli et al. [PDF]
[x] InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (2023), Dongchao Yang et al. [PDF]
[x] AudioLDM: Text-to-Audio Generation with Latent Diffusion Models(2023), Haohe Liu et al. [PDF]
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion(2023), Flavio Schneider et al. [PDF]
[x] Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models(2023), Rongjie Huang et al. [PDF]
[x] VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023), Chengyi Wang et al. [PDF]
[x] Diffsound: Discrete Diffusion Model for Text-to-sound Generation (2022), Dongchao Yang et al.[PDF]
[x] VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature (2022), Chenpeng Du. [PDF]
[x] DiscreTalk: Text-to-Speech as a Machine Translation Problem (2020), Tomoki Hayashi. [PDF]
[x] DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders (2022), Yanqing Liu [PDF]
[x] PromptTTS: Controllable Text-to-Speech with Text Descriptions (2022), Zhifang Guo [PDF]

Audio Language Models

[x] AudioLM: a Language Modeling Approach to Audio Generation(2022), Zalán Borsos et al. [PDF]

Audio SSL and UL models

[x] vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations(2019), Alexei Baevski et al. [PDF]
MuLan: A Joint Embedding of Music Audio and Natural Language (2022) Qingqing Huang et al. [PDF]
[x] W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training (2021) [PDF]
[x] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units (2021) Wei-Ning Hsu et al. [PDF],
[x] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations (2020), Alexei Baevski et al. [PDF]
[x] Data2vec: A general framework for self-supervised learning in speech, vision and language (2022), Alexei Baevski et al. [PDF]
[x] SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing (2022), Junyi Ao. [PDF]
[x] CLAP: Learning Audio Concepts From Natural Language Supervision (2022), Benjamin Elizalde [PDF]
[x] AudioGen: Textually Guided Audio Generation (2023), Felix Kreuk. [PDF]
[x] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing (2022), Sanyuan Chen [PDF]
[x] Wav2vec: Unsupervised Pre-training for Speech Recognition (2019), Steffen Schneider [PDF]
[x] A Brief Overview of Unsupervised Neural Speech Representation Learning (2022), Lasse Borgholt [PDF]
[x] ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers (2022), Kaizhi Qian [PDF]
[x] IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion (2022), Wendong Gan [PDF]
[x] SoftVC: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion (2022), Benjamin van Niekerk [PDF]

Encodec models

[x] High Fidelity Neural Audio Compression (2022), Alexandre Défossez [PDF]
[x] SoundStream: An End-to-End Neural Audio Codec (2021), Neil Zeghidour [PDF]
[x] HIFI-CODEC: GROUP-RESIDUAL VECTOR QUANTIZATION FOR HIGH FIDELITY AUDIO CODEC [PDF]

System-level

[x] AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head (2023), Rongjie Huang [PDF]

无 Code List:

Sample Efficient Adaptive Text-to-Speech (Yutian Chen, DeepMind & Google, 2019 Jan, ICLR 2019)
Attentron: Few-shot Text-to-Speech Exploiting Attention-based Variable Length Embedding (Seungwoo Choi, Hyperconnect, 2020 Aug., Interspeech 2020)
BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization(https://arxiv.org/abs/2002.01953) (Henry B. Moss, Amazon, 2020 Feb., ICASSP 2020)
Expressive Neural Voice Cloning (Paarth Neekhara, 2021 Jan, 加州大学圣地亚哥分校, PMLR 2021)
CUHK-EE VOICE CLONING SYSTEM FOR ICASSP 2021 M2VOC CHALLENGE (Daxin Tan, 港中文, 2021 Jul)
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech (Sung-Feng Huang, National Taiwan Uni, 2021 Nov, IEEE/ACM Transactions on Audio, Speech, and Language Processing)
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning (Yu Zhang, Google, 2019 Jul.)
The Multi-Speaker Multi-Style Voice Cloning Challenge 2021 (Qicong Xie, ASLP, 2021 Apr., ICASSP 2021)
Dian: Duration Informed Auto-Regressive Network for Voice Cloning (Wei song, JD, 2021 May, ICASSP 2021)
Self supervised learning for robust voice cloning (Konstantinos Klapsas, Innoetics, 2022 Apr, submitted to Interspeech 2022)
Improve few-shot voice cloning using multi-modal learning (Haitong Zhang, NetEase Games, 2022 Mar, 2022 IEEE International Conference on Acoustics, Speech and Signal Processing)
Cloning one’s voice using very limited data in the wild (Dongyang Dai, (SAMI), ByteDance, 2021 Oct)
V2C: Visual Voice Cloning (Qi Chen, 阿德莱德大学（澳洲），华南理工大学, 2021 Nov., )

综述类:

Voice cloning(Saiesh Prabhu Verlekar, Shree Rayeshwar Institute Of Engineering And Information Technology Goa 2022 Feb, IRJETS 2022)

海阔天空蓝

few-shot-TTS

Zero/One/Few-shot TTS的所有论文

Prompt-based Audio Synthesis

Audio Language Models

Audio SSL and UL models

Encodec models

System-level