The way Alexa responds and sounds is critical for natural interaction and a key component to ensure a delightful experience for customers. The speech output on Alexa is accomplished via text-to-speech (TTS) technology that converts sequences of words into natural sounding, intelligible audio responses. Since the first Amazon Echo and Alexa launched in November 2014, our TTS technology has enabled Alexa to select and string together short speech snippets (known as diphones) and form a word, phrase, or sentence delivered as a voice response to customers. We have continued to optimize our machine learning (ML) algorithms to determine which diphones to pick and how to string them together to form the most natural response.
Recently, we made interactions with Alexa even more natural through the development of a new Neural TTS technology (NTTS). NTTS delivers a more natural sounding voice and, depending on the context of your request, allows Alexa to adapt her speaking style as well. Just the way humans vary their way of speaking based on the situation, our new TTS technology enables Alexa to deliver the day's news by adapting a different speaking style as compared to how she would sound when, for example, providing information from Wikipedia.
Deep neural networks power Alexa's more natural voice
To achieve Alexa's more natural sounding and higher quality voice, Amazon scientists took a completely new approach to speech synthesis called direct waveform modeling that applies deep learning to produce the speech signal . NTTS produced speech has better intonation, emphasizes the right words in a sentence, and improved segmental quality when compared to previous TTS technologies.
Today we are taking the first step in adapting Alexa's speaking style based on the context of your request with the introduction of a newscaster speaking style.
For customers in the US, when you ask, "Alexa, what's the latest," Alexa will change her speaking style to be similar to how a professional newscaster delivers the news. Alexa's newscast voice knows which words or phrases should be emphasized for a more realistic delivery of the news.
Below is an audio sample of the previous technology, followed by one of the new newscaster voice.
You can also experience NTTS with the latest information from Wikipedia. For example, you can ask "Alexa, Wikipedia Nick Jonas" or any other subject or topic available on Wikipedia to listen to the Alexa's neutral speaking style voice answer your question.
"The ability to teach Alexa to adapt her speaking style based on the context of the customer's request opens the possibility to deliver new and delightful experiences that were previously unthinkable," says Andrew Breen, Sr. Manager with the TTS Research team at Amazon. "We're thrilled that our customers will get to listen to news and Wikipedia information from Alexa in this new way."