It’s just a matter of time before you won’t have the ability to tell the difference between speaking with a human or speaking with a robot. Or, perhaps that time is now. You see, Google’s DeepMind team recently announced a new AI called WaveNet. They are the same group that created AlphaGo, which defeated one of the world’s best Go players.
So, the WaveNet group fed the neural network raw audio waveforms recorded from real human speakers. Currently, text-to-speech (TTS) systems utilize a system called concatenative TTS, where the audio is generated by recombining fragments of recorded speech. On the other hand, DeepMind has around 16,000 samples per second.
In addition, WaveNet is a “neural network” that is trained on real waveforms. It then uses statistics to choose which samples of that audio to use when “speaking,” piece by piece. In a recent post, DeepMind’s researchers wrote:
Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio.
In a blind test with human subjects, DeepMind states that WaveNet’s audio is around 50% closer to real human speech.