Section 5.1.1: Comparison under standard foreign accent conversion condition (L1 and L2 speakers were seen during training)
We use BDL as the L1 speaker for all L2 speakers.
Systems:
Baseline1 [3]: System from Zhao et al.
Baseline2 [4]: System from Liu et al.
Proposed: Proposed approach.
L2 speaker
Text
L1 reference speech
L2 reference speech
Baseline1
Baseline2
Proposed
NJS
I'll tell you, the librarian said with a brightening face.
TXHC
But she had become an automaton.
YKWK
At the best, they were necessary accessories.
ZHAA
You were making them talk shop, Ruth charged him.
Section 5.1.2: Reverse foreign accent conversion under standard condition
Reverse foreign accent conversion: Synthesize a voice that has L1 speaker's voice quality but an L2 accent.
We use BDL as the L1 speaker for all L2 speakers.
Systems:
Proposed: Proposed approach.
L2 speaker
Text
L1 reference speech
L2 reference speech
Proposed
NJS
I'll tell you, the librarian said with a brightening face.
TXHC
But she had become an automaton.
YKWK
At the best, they were necessary accessories.
ZHAA
You were making them talk shop, Ruth charged him.
Section 5.2.1: Comparison under different zero-shot foreign accent conversion conditions (L1 and (or) L2 speakers were seen during training)
Systems:
Condition SS: The L1 and L2 speakers were both seen during training.
Condition US: The L1 speaker was unseen during training, and the L2 speaker was seen during training.
Condition SU: The L1 speaker was seen during training, and the L2 speaker was unseen during training.
Condition UU: The L1 and L2 speakers were both unseen during training.
Training configurations:
To guarantee the testing speakers are unknown during training, we trained four models using different training sets.
In Condition SS, we used the same model as standard FAC condition.
In Condition US, we excluded the training set of CLB and used CLB as the testing L1 speaker.
In condition SU, we excluded the training set of the four testing L2 speakers.
In condition 4, we excluded the training set of CLB and four testing L2 speakers, and we also used CLB as the testing L1 speaker.
For unseen L1/L2 speakers, we used the 50 utterances from the test set to generate the accent/speaker embedding.
L2 speaker
Text
seen L1 reference (for condition SS/SU)
unseen L1 reference speech (for condition US/UU)
L2 reference speech
Condition SS
Condition US
Condition SU
Condition UU
NJS
I'll tell you, the librarian said with a brightening face.
TXHC
But she had become an automaton.
YKWK
At the best, they were necessary accessories.
ZHAA
You were making them talk shop, Ruth charged him.
Section 5.2.2: Influence of the number of available L2 utterances under zero-shot foreign accent conversion conditions
Systems and configurations:
We used Condition UU as a baseline in this experiment, which used 50 test utterances to produce the speaker embedding during inference. We reduced the number from 50 to 1 (50,20,10,5,1) and evaluated the system performance.
We also compared our proposed system with a finetuned system using different number of utterances as a reference.
L2 speaker
Text
L1 reference
L2 reference speech
50
20
10
5
1
Proposed   Finetune
Proposed   Finetune
Proposed   Finetune
Proposed   Finetune
Proposed   Finetune
NJS
I'll tell you, the librarian said with a brightening face.
TXHC
But she had become an automaton.
YKWK
At the best, they were necessary accessories.
ZHAA
You were making them talk shop, Ruth charged him.
References
[1] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA Workshop on Speech Synthesis, 2004.
[2] G. Zhao, S. Sonsaat, A. O. Silpachai, I. Lucic, E. Chukharev-Khudilaynen, J. Levis, et al., "L2-ARCTIC: A Non-Native English Speech Corpus," Perception Sensing Instrumentation Lab, 2018.
[3] G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams," Proc. Interspeech 2019, pp. 2843-2847, 2019.
[4] S. Liu, D. Wang, Y. Cao, et al., "End-to-end accent conversion without using native utterances," Proc. ICASSP 2020, pp. 6289-6293, 2020.