Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion

Shaojin Ding, Ricardo Gutierrez-Osuna

Department of Computer Science and Engineering, Texas A&M University, USA

Voice Conversion Audio Samples

Dataset: CSTR VCTK dataset [1]

Note: the parallel samples shown here are just for easier comparison. In training, we did not require parallel corpus and we did not use parallel corpus.


Systems:

Male → Male

Source Target GLE (Proposed) VQ-VAE PPG-GMM
Sample 1
Sample 2

Male → Female

Source Target GLE (Proposed) VQ-VAE PPG-GMM
Sample 1
Sample 2

Female → Female

Source Target GLE (Proposed) VQ-VAE PPG-GMM
Sample 1
Sample 2

Female → Male

Source Target GLE (Proposed) VQ-VAE PPG-GMM
Sample 1
Sample 2

References

[1] Veaux, Christophe, Junichi Yamagishi, and Kirsten MacDonald. "CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit." (2017).

[2] van den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in Neural Information Processing Systems. 2017. [PDF]

[3] Zhao, Guanlong, et al. "Accent conversion using phonetic posteriorgrams." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. [PDF]