Google and DeepMind are teaching the AI how to sound like people
U.S. – Google Inc. (NASDAQ: inGOOGL) is working with a third-party company called DeepMind to enhance their AI personal assistant. Google’s DeepMind is teaching Assistants to sound like humans in a step-by-step process.
The subsidiary is called the software WaveNet. It will be an all-new model for computers to translate text to speech. Google says it can mimic any voice and that it will sound more natural than any other available text-to-speech tech available today.
WaveNet generates sounds by sampling real human speech and creating audio waveforms similar to it. In Google’s test, both Mandarin Chinese and English listeners found the program more realistic than others, although they could still tell it was a machine.
Microsoft’s assistant, Cortana, as well as Siri reply questions with actual recordings of human voices, rearranged and combined in pieces. Apple and Microsoft make actors record massive amounts of dialogues to feed their assistant’s database.
How does WaveNet work?
The DeepMind subsidiary offered an in-depth look at how the system works. To synthesize audio signals they start with convolutional neural networks. The network compiles pure audio waveforms instead of recorded voices, allowing for a more natural-sounding speech.
The project had a rough start. Raw audio requires more than 16,000 samples a second for a computer to process. It all came together when the engineers constructed the neural network that uses actual human waveforms.
In essence, the researchers taught the machine how to talk like people when they stored human waveforms on the database.
“Building up samples one step at a time like this is computationally expensive,” said a DeepMind spokesperson , “but we have found it essential for generating complex realistic-sounding audio.”
However, AIs have a long way to go before perfectly talking like people. The software might learn and eventually know how to imitate a language, but it doesn’t know anything about its content. If someone feeds it up with senseless noise, it will make a mysterious clip complete with pauses and breathing sounds.
Google’s system still relies on real voice input
Only this time, instead of chopping different recordings to create a phrase, the software learns from them and creates independent new sounds with a variety of voices.
The Verge tested WaveNet with around 500 blind listeners who rated on a scale from one to five how realistic it sounds – 1 being not realistic and 5 being close to natural.
WaveNet ended up with 4.21 points in U.S. English. Interestingly so, the human speech wasn’t realistic enough and only got 4.55 points.
Source: The Verge