Accentron: Foreign Accent Conversion to arbitrary non-native speakers Using Zero-Shot Learning

Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna

Department of Computer Science and Engineering, Texas A&M University, USA

Accent Conversion Audio Samples

Dataset: CMU ARCTIC [1] and L2-ARCTIC [2]


Section 5.1.1: Comparison under standard foreign accent conversion condition (L1 and L2 speakers were seen during training)

We use BDL as the L1 speaker for all L2 speakers.

Systems:

L2 speaker Text L1 reference speech L2 reference speech Baseline1 Baseline2 Proposed
NJS I'll tell you, the librarian said with a brightening face.
TXHC But she had become an automaton.
YKWK At the best, they were necessary accessories.
ZHAA You were making them talk shop, Ruth charged him.

Section 5.1.2: Reverse foreign accent conversion under standard condition

Reverse foreign accent conversion: Synthesize a voice that has L1 speaker's voice quality but an L2 accent.

We use BDL as the L1 speaker for all L2 speakers.

Systems:

L2 speaker Text L1 reference speech L2 reference speech Proposed
NJS I'll tell you, the librarian said with a brightening face.
TXHC But she had become an automaton.
YKWK At the best, they were necessary accessories.
ZHAA You were making them talk shop, Ruth charged him.

Section 5.2.1: Comparison under different zero-shot foreign accent conversion conditions (L1 and (or) L2 speakers were seen during training)

Systems:

Training configurations:

To guarantee the testing speakers are unknown during training, we trained four models using different training sets.

For unseen L1/L2 speakers, we used the 50 utterances from the test set to generate the accent/speaker embedding.

L2 speaker Text seen L1 reference (for condition SS/SU) unseen L1 reference speech (for condition US/UU) L2 reference speech Condition SS Condition US Condition SU Condition UU
NJS I'll tell you, the librarian said with a brightening face.
TXHC But she had become an automaton.
YKWK At the best, they were necessary accessories.
ZHAA You were making them talk shop, Ruth charged him.

Section 5.2.2: Influence of the number of available L2 utterances under zero-shot foreign accent conversion conditions

Systems and configurations:

We used Condition UU as a baseline in this experiment, which used 50 test utterances to produce the speaker embedding during inference. We reduced the number from 50 to 1 (50,20,10,5,1) and evaluated the system performance.

We also compared our proposed system with a finetuned system using different number of utterances as a reference.

L2 speaker Text L1 reference L2 reference speech 50 20 10 5 1
Proposed   Finetune Proposed   Finetune Proposed   Finetune Proposed   Finetune Proposed   Finetune
NJS I'll tell you, the librarian said with a brightening face.
TXHC But she had become an automaton.
YKWK At the best, they were necessary accessories.
ZHAA You were making them talk shop, Ruth charged him.

References

[1] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA Workshop on Speech Synthesis, 2004.

[2] G. Zhao, S. Sonsaat, A. O. Silpachai, I. Lucic, E. Chukharev-Khudilaynen, J. Levis, et al., "L2-ARCTIC: A Non-Native English Speech Corpus," Perception Sensing Instrumentation Lab, 2018.

[3] G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams," Proc. Interspeech 2019, pp. 2843-2847, 2019.

[4] S. Liu, D. Wang, Y. Cao, et al., "End-to-end accent conversion without using native utterances," Proc. ICASSP 2020, pp. 6289-6293, 2020.