“Beyond capturing several hours of high-quality audio that can be sliced and diced to create voice responses, developers face the challenge of getting the prosody – the patterns of stress and intonation in spoken language – just right,” Ghoshal reports. “That’s where machine learning comes in. With enough training data, it can help a text-to-speech system understand how to select segments of audio that pair well together to create natural-sounding responses.”
“The results speak for themselves (ba dum tiss): Siri’s navigation instructions, responses to trivia questions and ‘request completed’ notifications sound a lot less robotic than they did two years ago,” Ghoshal reports. “You can hear them for yourself at the end of this paper from Apple.”
Read more in the full article here.
MacDailyNews Take: Listen to the iOS 9 and iOS 10 samples vs. the new iOS 11 samples. The new Siri voice is a vast improvement!
Deep learning for Siri’s Voice: On-device deep mixture density networks for hybrid unit selection synthesis – August 23, 2017