Accent Conversion Audio Samples

Dataset: CMU ARCTIC [1] and L2-ARCTIC [2]

Section 5.1.1: Comparison under standard foreign accent conversion condition (L1 and L2 speakers were seen during training)

We use BDL as the L1 speaker for all L2 speakers.

Systems:

Baseline1 [3]: System from Zhao et al.
Baseline2 [4]: System from Liu et al.
Proposed: Proposed approach.

L2 speaker	Text	L1 reference speech	L2 reference speech	Baseline1	Baseline2	Proposed
NJS	I'll tell you, the librarian said with a brightening face.
TXHC	But she had become an automaton.
YKWK	At the best, they were necessary accessories.
ZHAA	You were making them talk shop, Ruth charged him.

Section 5.1.2: Reverse foreign accent conversion under standard condition

Reverse foreign accent conversion: Synthesize a voice that has L1 speaker's voice quality but an L2 accent.

We use BDL as the L1 speaker for all L2 speakers.

Systems:

Proposed: Proposed approach.

L2 speaker	Text	L1 reference speech	L2 reference speech	Proposed
NJS	I'll tell you, the librarian said with a brightening face.
TXHC	But she had become an automaton.
YKWK	At the best, they were necessary accessories.
ZHAA	You were making them talk shop, Ruth charged him.

Section 5.2.1: Comparison under different zero-shot foreign accent conversion conditions (L1 and (or) L2 speakers were seen during training)

Systems:

Condition SS: The L1 and L2 speakers were both seen during training.
Condition US: The L1 speaker was unseen during training, and the L2 speaker was seen during training.
Condition SU: The L1 speaker was seen during training, and the L2 speaker was unseen during training.
Condition UU: The L1 and L2 speakers were both unseen during training.

Training configurations:

To guarantee the testing speakers are unknown during training, we trained four models using different training sets.

In Condition SS, we used the same model as standard FAC condition.
In Condition US, we excluded the training set of CLB and used CLB as the testing L1 speaker.
In condition SU, we excluded the training set of the four testing L2 speakers.
In condition 4, we excluded the training set of CLB and four testing L2 speakers, and we also used CLB as the testing L1 speaker.

For unseen L1/L2 speakers, we used the 50 utterances from the test set to generate the accent/speaker embedding.

L2 speaker	Text	seen L1 reference (for condition SS/SU)	unseen L1 reference speech (for condition US/UU)	L2 reference speech	Condition SS	Condition US	Condition SU	Condition UU
NJS	I'll tell you, the librarian said with a brightening face.
TXHC	But she had become an automaton.
YKWK	At the best, they were necessary accessories.
ZHAA	You were making them talk shop, Ruth charged him.

Section 5.2.2: Influence of the number of available L2 utterances under zero-shot foreign accent conversion conditions

Systems and configurations:

We used Condition UU as a baseline in this experiment, which used 50 test utterances to produce the speaker embedding during inference. We reduced the number from 50 to 1 (50,20,10,5,1) and evaluated the system performance.

We also compared our proposed system with a finetuned system using different number of utterances as a reference.

L2 speaker	Text	50	20	10	5	1
		Proposed Finetune	Proposed Finetune	Proposed Finetune	Proposed Finetune	Proposed Finetune
NJS	I'll tell you, the librarian said with a brightening face.
TXHC	But she had become an automaton.
YKWK	At the best, they were necessary accessories.
ZHAA	You were making them talk shop, Ruth charged him.

References

[1] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA Workshop on Speech Synthesis, 2004.

[2] G. Zhao, S. Sonsaat, A. O. Silpachai, I. Lucic, E. Chukharev-Khudilaynen, J. Levis, et al., "L2-ARCTIC: A Non-Native English Speech Corpus," Perception Sensing Instrumentation Lab, 2018.

[3] G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams," Proc. Interspeech 2019, pp. 2843-2847, 2019.

[4] S. Liu, D. Wang, Y. Cao, et al., "End-to-end accent conversion without using native utterances," Proc. ICASSP 2020, pp. 6289-6293, 2020.

Accentron: Foreign Accent Conversion to arbitrary non-native speakers Using Zero-Shot Learning

Shaojin Ding, Guanlong Zhao, Ricardo Gutierrez-Osuna

Department of Computer Science and Engineering, Texas A&M University, USA

Accent Conversion Audio Samples

Section 5.1.1: Comparison under standard foreign accent conversion condition (L1 and L2 speakers were seen during training)

Systems:

Section 5.1.2: Reverse foreign accent conversion under standard condition

Systems:

Section 5.2.1: Comparison under different zero-shot foreign accent conversion conditions (L1 and (or) L2 speakers were seen during training)

Systems:

Training configurations:

Section 5.2.2: Influence of the number of available L2 utterances under zero-shot foreign accent conversion conditions

Systems and configurations:

References