



Interspeech 2021

清华深研院 - 吴致勇教授团队


Towards Multi-Scale Style Control for Expressive Speech Synthesis

VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis


Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion


Voting for the Right Answer: Adversarial Defense for Speaker Verification

西北工业大学 - 谢磊教授团队


Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis

Controllable Context-Aware Conversational Speech Synthesis

Improving Performance of Seen and Unseen Speech Style Transfer in End-to-End Neural TTS


Enriching Source Style Transfer in Recognition-Synthesis Based Non-Parallel Voice Conversion

Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder


DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement

AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit

Auto-KWS 2021 Challenge: Task, Datasets, and Baselines

Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition

F-T-LSTM Based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement

Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification

Microsoft - 谭旭团队


Adaptive Text to Speech for Spontaneous Style



Cross-Domain Speech Recognition with Unsupervised Character-Level Distribution Matching

Google - Heiga Zen 团队


WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling

PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS



Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation

新加坡国立大学 - 李海洲团队



Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training

Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation

Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability


Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding

Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement

Phonetically Motivated Self-Supervised Speech Representation Learning

Diagnosis of COVID-19 Using Auditory Acoustic Cues

Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers

Neural Speaker Extraction with Speaker-Speech Cross-Attention Network

GlobalPhone Mix-To-Separate Out of 2: A Multilingual 2000 Speakers Mixtures Database for Speech Separation

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification


Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification

北京大学 深研院 - 邹月娴团队

Spoken Dialogue Systems

Self-Supervised Dialogue Learning for Spoken Conversational Question Answering

Semantic Transportation Prototypical Network for Few-Shot Intent Detection

SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification

Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification

Contextualized Attention-Based Knowledge Transfer for Spoken Conversational Question Answering

Text Anchor Based Metric Learning for Small-Footprint Keyword Spotting

USTC 中科大 - 凌震华团队


UnitNet-Based Hybrid Speech Synthesis

A Neural-Network-Based Approach to Identifying Speakers in Novels


Adversarial Voice Conversion Against Neural Spoofing Detectors

台湾国立大学 - 李宏毅团队


S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations


SUPERB: Speech Processing Universal PERformance Benchmark

Towards Lifelong Learning of End-to-End ASR

Auto-KWS 2021 Challenge: Task, Datasets, and Baselines

Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training

Voting for the Right Answer: Adversarial Defense for Speaker Verification


Utilizing Self-Supervised Representations for MOS Prediction

UoE 爱丁堡大学 - Simon King 团队


Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech

Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis


ADEPT: A Dataset for Evaluating Prosody Transfer