CSSR (Proposed): The method proposed in this paper.
System 1 [2]: The method we proposed in [2], which constructs the structured dictionaries using phoneme labels during training and jointly optimizes the standard cost function along with the Phoneme-Selective Object Function (eq. (6)) at runtime.
System 2 [3]: The method we proposed in [3], which learns the structured dictionary in the joint source-target space without supervision (i.e., without phoneme labels) during training and selects the most likely sub-dictionary [13] using the standard cost function (eq. (2)) at runtime.
Baseline [4]: A GMM-based VC method that models the joint distribution of source and target speech frames.
Female → Female
Source
Target
CSSR (Proposed)
System 1
System 2
Baseline
Sample 1 >
Sample 2 >
Sample 3 >
Female → Male
Source
Target
CSSR (Proposed)
System 1
System 2
Baseline
Sample 1 >
Sample 2 >
Sample 3 >
Male → Male
Source
Target
CSSR (Proposed)
System 1
System 2
Baseline
Sample 1 >
Sample 2 >
Sample 3 >
Male → Female
Source
Target
CSSR (Proposed)
System 1
System 2
Baseline
Sample 1 >
Sample 2 >
Sample 3 >
References
[1] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA Workshop on Speech Synthesis, 2004. [PDF]
[2] S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Improving Sparse Representations in Exemplar-Based Voice Conversion with a Phoneme-Selective Objective Function," Proc. Interspeech 2018, pp. 476-480, 2018. [PDF]
[3] S. Ding, C. Liberatore, and R. Gutierrez-Osuna, "Learning Structured Dictionaries for Exemplar-based Voice Conversion," Proc. Interspeech 2018, pp. 481-485, 2018. [PDF]
[4] T. Toda, A. W. Black, and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 2222-2235, 2007. [PDF]