Note: the parallel samples shown here are just for easier comparison. In training, we did not require parallel corpus and we did not use parallel corpus.
Systems:
GLE (Proposed): Use Group Latent Embedding in Vector Quantized Variational Autoencoder for voice conversion.
VQVAE [2]: Conventional Vector Quantized Variational Autoencoder for voice conversion.
PPG-GMM [3]: Use posteriorgram to pair source and target frames, and then use joint Gaussian Mixture Models for voice conversion.
Male → Male
Source
Target
GLE (Proposed)
VQ-VAE
PPG-GMM
Sample 1 >
Sample 2 >
Male → Female
Source
Target
GLE (Proposed)
VQ-VAE
PPG-GMM
Sample 1 >
Sample 2 >
Female → Female
Source
Target
GLE (Proposed)
VQ-VAE
PPG-GMM
Sample 1 >
Sample 2 >
Female → Male
Source
Target
GLE (Proposed)
VQ-VAE
PPG-GMM
Sample 1 >
Sample 2 >
References
[1] Veaux, Christophe, Junichi Yamagishi, and Kirsten MacDonald. "CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit." (2017).
[2] van den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." Advances in Neural Information Processing Systems. 2017. [PDF]
[3] Zhao, Guanlong, et al. "Accent conversion using phonetic posteriorgrams." 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018. [PDF]