Machine learning algorithms excel at generating realistic photos, videos, and even voices. Last year, researchers at AI startup Dessa created a convincing fake audio file of popular American podcaster Joe Rogan’s voice. In an Instagram post, Rogan responded to the highly realistic spoof: “At this point I’ve long ago left enough content out there that they could basically have me saying anything they want…” Although few-shot training may be changing this, Rogan was not wrong about the large voice library he has generated. Generally speaking, in ML the more training data the better, and this is also the case in voice synthesis.
Although current machine learning techniques enable researchers to synthesize even singing voices at a similarly high quality, existing singing-voice datasets typically include only single singers. In an effort to enrich resources for multispeaker singing-voice synthesis, a team of researchers from the University of Tokyo has developed a Japanese multispeaker singing-voice corpus. The project design and experimental analyses are presented in the paper JVS-MuSiC: Japanese multispeaker Singing-Voice Corpus.
JVS-MuSiC includes singing voices from different singers and explores the synthesis of unique traits of singing voices. A total of 100 Japanese singers (49 males and 51 females) were recorded performing the popular Japanese children’s song Katasumuri. Each singer also contributed another, different song. The key and tempos in the 100 Katasumuri versions were not consistent, as the singers were not asked to sing along with an example recording or use a melody or tempo guide. By digitally adjusting the voices, researchers were able to assign all the versions to set of key and tempo groupings.
The researchers analyzed the correlation between singing voice and other synthesis involved factors, confirming for example the moderate but positive correlation between singing-voice similarity and compatibility for singing in unison: “a pair of singers with similar voices produces a united unison voice, which is often considered to sound beautiful.”
The paper JVS-MuSiC: Japanese Multispeaker Singing-Voice Corpus is on arXiv, and the dataset can be downloaded here.
Journalist: Fangyu Cai | Editor: Michael Sarazen