K. Lee and M. Cremer (USA)
singing voice detection, supervised learning, automatic labeling, MIDI, dynamic time warping
We present a novel approach to labeling a large amount of training data for vocal/non-vocal discrimination in musical audio with the minimum amount of human labor. To this end, we use MIDI files for which vocal lines are encoded on a separate channel and synthesize them to create audio files. We then align synthesized audio with real recordings using dynamic time warping (DTW) algorithm. Note on set/offset information encoded in vocal lines in MIDI files provides precise vocal/non-vocal boundaries and we obtain from the minimum-cost alignment path the corresponding boundaries in actual recordings. This near labor-free labeling process allows us to acquire a large training data set, and the experiments show promising results when tested on an independent test set, using hidden Markov models as a classifier. We also demonstrate that the data generated by the proposed system is good data by showing that the overall performance increases with more training data.
Important Links:
Go Back