Singing is an extraordinary human skill. It requires the ability to form words, then the ability to vocalize them at a certain pitch and finally the ability to synchronize this with the notes. For many, it comes naturally — humans seemed to be hardwired for singing.
Not so machines. Teaching a computer to sing — to turn a musical score into vocalized song — has turned out to be hugely frustrating.
First, these devices must master the ability to turn text into speech, which is itself an ongoing challenge in computer science. They must then match the words to the notes at the level of syllables and even at the level of phonemes. Finally, these phonemes, syllables and words need to be vocalized at the correct pitch and for the right duration.
That turns out to be hard. Various groups around the world have attempted it, sometimes with impressive results. But in each case, the final output requires significant tweaking to achieve any level of realism.
Singing Bot
That could be about to change thanks to the work of Peiling Lu and colleagues at the Microsoft Technology Center Asia. The team has been working on a way to give the company’s chatbot, Xiaoice (pronounced Shao-ice), the ability to sing. The new singing bot is called XiaoiceSing, and the results are impressive.
First, some background. The task in “singing voice synthesis” is to turn a musical score into a voiced song that is indistinguishable from a human effort. Lu and colleagues point out that a score consists of the song lyrics, along with the song notes and note duration. For a professional human singer, it is straightforward to turn this written information into a song.
But for a computer, the task begins by translating the score into machine-readable form. XiaoiceSing does this by dividing the worlds into phonemes and then allocating a pitch and duration to each. This can be expressed in the form of a vector that a computer can “read.”
But this translation process is tricky. Every word is made of syllables, and these in turn are formed from phonemes. For example, the word “sing” is a single syllable made up of three phonemes.
The score could suggest the entire word be sung for several beats. But the problem for XiaoiceSing is to divide those beats between the phonemes. Should it place equal emphasis on each phoneme, or more on the middle or final phonemes?
Just as important are the pauses between notes when nothing is sung. The human ear is hugely sensitive to this pattern, which plays an important role in the rhythm of singing. That makes small differences generated by a machine all too obvious.
Then there is the problem of hitting the right note. When a human sings, the sound is made up of lots of frequencies. The combination of frequencies differs as the note and its quality changes — for example, when singing different phonemes.
Fundamental Frequency
In general, the actual note is the lowest frequency sound — the fundamental frequency. This tends to be the loudest and the one the human ear most easily picks out. But the quality of the sound is determined by the other frequencies that form a kind of envelope around the fundamental frequency.
The task of producing the correct envelope for a given phoneme and the correct pitch is far from easy. And any mistake gives the impression of singing out of tune.
Pu and colleagues tackle all this using a variety of machine learning techniques and applying them to proven technologies. For example, XiaoiceSing uses a text-to-speech system called FastSpeech, a technology that many of the team developed at Microsoft.
The output from FastSpeech must then be decoded and vocalized, or “vocoded.” And for this, XiaoiceSing uses a speech synthesis vocoder called WORLD, which must be trained to produce a humanlike sound.
All this training is done with a dataset of 2,297 Mandarin pop songs recorded by a female professional singer and then divided into 10-second sections. The machine essentially learns by associating the spectral features of the human song with the machine-readable score, and repeating this with over 10,000 samples from the dataset.
Then, given a new score the machine has never seen, it can produce a humanlike output.
The results are impressive. Here is a short song sung by a human. And here is the same song produced by XiaoiceSing from the score.
Not bad! And, for comparison purposes, the team also output the same song using more conventional machine-learning techniques.
Judge for yourself. But in their own tests which involved asking listeners which machine-sung version they preferred, XiaoiceSing repeatedly came out on top. The team has more examples here.
That sets up an interesting future for singing. Songs sung entirely by computer-generated characters are already a feature of certain pop scenes. But they are far from perfect. XiaoiceSing isn’t perfect either, but it is an interesting step forward. A potential pop star in the making?
Ref: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System arxiv.org/abs/2006.06261