Learning Structured Sparse Representations for Voice Conversion

Shaojin Ding, Guanlong Zhao, Christopher Liberatore, Ricardo Gutierrez-Osuna

Department of Computer Science and Engineering, Texas A&M University, USA

Voice Conversion Audio Samples

Dataset: CMU ARCTIC dataset [1]


Systems:

Female → Female

Source Target CSSR (Proposed) System 1 System 2 Baseline
Sample 1
Sample 2
Sample 3

Female → Male

Source Target CSSR (Proposed) System 1 System 2 Baseline
Sample 1
Sample 2
Sample 3

Male → Male

Source Target CSSR (Proposed) System 1 System 2 Baseline
Sample 1
Sample 2
Sample 3

Male → Female

Source Target CSSR (Proposed) System 1 System 2 Baseline
Sample 1
Sample 2
Sample 3

References

[1] J. Kominek and A. W. Black, "The CMU Arctic speech databases," in Fifth ISCA Workshop on Speech Synthesis, 2004. [PDF]

[2] S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Improving Sparse Representations in Exemplar-Based Voice Conversion with a Phoneme-Selective Objective Function," Proc. Interspeech 2018, pp. 476-480, 2018. [PDF]

[3] S. Ding, C. Liberatore, and R. Gutierrez-Osuna, "Learning Structured Dictionaries for Exemplar-based Voice Conversion," Proc. Interspeech 2018, pp. 481-485, 2018. [PDF]

[4] T. Toda, A. W. Black, and K. Tokuda, "Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, pp. 2222-2235, 2007. [PDF]