NATIVE ACCENT SENSITIVE VOICE CLONING USING PAIRWISE RANKING BASED DECODER MODELS

Chetan Madan,∗ Harshita Diddee,∗ Deepika Kumar,∗ Shilpa Gupta,∗ Shivani Jindal,∗ Mansi Lal,∗ and Chiranjeev∗

References

  1. [1] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, Exploring the limits of language modeling, arXiv preprint arXiv:1602.02410, 2016.
  2. [2] F. Nolan, Forensic phonetics, Journal of Linguistics, 27(2), 1991, 483–493
  3. [3] L.H. Kim, S.I. Kang, H.S. Ryu, W.I. Jang, S.B. Lee, and A. Pandya, Design and implementation of artificial intelligent motorized wheelchair system using speech recognition and joystick, Control and Intelligent Systems, 33, 2005. doi: 10.2316/Journal.201.2005.2.201-1512.
  4. [4] Y. Shao, W. Dong, S. Ma, and X. Sun, Enriching classroom teaching means by utilizing E-learning resources in collaborative edge and core cloud, Mechatronic Systems and Control, 47(2), 2019, doi: 10.2316/J.2019.201-2977.
  5. [5] Y. Stylianou, Voice transformation: A survey, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Taipei, Taiwan, 2009, 3585–3588.
  6. [6] S. Arık, J. Chen, K. Peng, W. Ping, and Y. Zhou, Neural voice cloning with a few samples, Advances in Neural Information Processing Systems, 2018, 10019–10029.
  7. [7] S.S. Mehri, et al., SampleRNN: An unconditional end-to-end neural audio generation model, in 5th International Conference on Learning Representations, ICLR 2017 Conference Track Proceedings, 2017.
  8. [8] J. Sotelo, S. Mehri, K. Kumar, J.F. Santos, K. Kastner, et al. Char2wav: End-To-End Speech Synthesis, ICLR, 2017.
  9. [9] J.H. Kim and S.B. Lee, Speech recognition using multilayer recurrent neural prediction models and HMM, Control and Intelligent Systems, 35, 2007, 9–14.
  10. [10] S.O. Arik, M. Chrzanowski, A. Coates, G. Diamos, et al., Deep voice: Real-time Neural Text-To-Speech, 2017.
  11. [11] S.O. Arik, et al., Deep voice 2: Multi-speaker neural text-to-speech, Advances in Neural Information Processing Systems, 2017-December, 2017, 2963–2971.
  12. [12] W. Ping, et al., Deep Voice 3: Scaling text-to-speech with convolutional sequence learning (University of California, Berkeley, OpenAI, ICLR, 2018).
  13. [13] Y. Wang, et al., Tacotron: Towards end-To-end speech synthesis, in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2017-August, 2017, 4006–4010.
  14. [14] J. Shen, et al., Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing Proceedings, vol. 2018-April, 2018, 4779–4783.
  15. [15] A. van den Oord, et al., WaveNet: A generative model for raw audio (2016).
  16. [16] T. Kaneko and H. Kameoka, CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks (2018).
  17. [17] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, Starganvc: Non-parallel many-to-many voice conversion using star generative adversarial networks (2018).
  18. [18] P. Parikh, K. Velhal, S. Potdar, A. Sikligar, and R. Karani, English language accent classification and conversion using machine learning, SSRN Electron. J., 2020.
  19. [19] K. Chionh, M. Song, and Y. Yin, Application of convolutional neural networks in Accent Identification (2019), 4–7.
  20. [20] J.S. Garofolo, et al., TIMIT acoustic-phonetic continuous speech corpus (1993).
  21. [21] C. Veaux, J. Yamagishi, and K. MacDonald, CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. The Centre for Speech Technology Research (CSTR), University of Edinburgh.
  22. [22] J. Kominek and A.W. Black, The CMU arctic databases for speech synthesis, Proceedings of ISCA Work. Speech Synthesis, 2004, 223–224.
  23. [23] L. Wan, Q. Wang, A. Papir, and I.L. Moreno, Generalized end-to-end loss for speaker verification, in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing Proceedings, vol. 2018-April, 2018, 4879–4883.

Important Links:

Go Back