Wake Alexa up and start a chat with her. Simply say, “Alexa, let’s talk,” or “Alexa, let’s chat.”
In less than the time it takes the average American humming bird to beat its wings three times, Alexa woke up, heard your initial request, made sense of it, and formulated a response.
Here is part of what you may hear back, “Hey, Happy Friday, it’s getting chilly up in the clouds these days.”
That spoken response from Alexa is about three seconds long, and in the course of that time here’s a peek at what happened to bring it to you.
- In the clouds where Alexa "lives," 270 billion floating-point operations (FLOPS—or how computers handle math) were performed to generate the words, intonations, pauses, and emphases that make Alexa’s response more natural-sounding.
- In order for Alexa to respond to you in HD audio through the speaker, about 72,000 snippets of sound were strung together to form the phrase.
- You looked at your phone, computer, or smartwatch and confirmed that it was indeed Friday, and that fall weather had arrived.
That all seems like a lot of work for one sentence, and frankly, it is. Now, imagine, tens of millions of Alexa customers making similar requests, billions of times every week. That is a massive amount of work. For the Alexa service, that work boils down to a great deal of computation using sophisticated machine learning models. And at the scale at which Alexa operates, those models and that computation have costs—in time on processing units and in electricity to both run and cool those units.
These costs were exactly what the Alexa team was running into about a year ago. The sheer computational load of making Alexa speak more naturally, and less like a machine stringing together words, was expensive.
Why human-sounding speech is tricky for machines
When you ask Alexa a question, the answer comes quickly. Her response seems akin to how humans respond when someone asks a question, but it isn’t (unless you write down both the question and your response, and then read it back). After the device detects the wake word, Alexa records and sends your request—“play a song,” “turn on the lights,” “order a pizza”—to the cloud. Automatic speech recognition and natural language understanding models make sense of the request then formulate an appropriate response.
But here's the tricky part, the words in that response from Alexa are generated as a string of text. As anyone who has learned another language knows, just reading the right words back in the sequence isn’t really speech. It’s communication, but not a natural conversation.
Making Alexa’s speech sound more natural in any language—the Text-to-Speech part—is all about improving intonation. That means stressing the right words, at the right cadence, and in the right places. We do it without thinking—but try analyzing your own intonation and you can see how hard a problem it is to precisely model, and why it is the key to natural speech. As Alexa's Text-to-Speech team had discovered, it’s also expensive.
Talk to me
The fix for the intonation-obsessed Text-to-Speech team came in the form of a new microprocessor or chip—called AWS Inferentia—announced at the 2019 re:Invent conference. Designed by a team at Amazon Web Services (AWS), Inferentia is the first chip Amazon engineered specifically for running machine learning models. Almost a year after launch, the Inferentia chip is now the computational engine for the so-called “inference” part of the Alexa service—the vast majority of what Alexa does in fielding and responding to people’s questions and statements. Using AWS Inferentia running on Amazon EC2 Inf1—virtual servers or compute engines that are accessed in the cloud—this custom chip has lowered the cost of running inference on Alexa by almost one-third.
The cost savings provide two benefits:
- Alexa can use less energy to complete these computationally heavy tasks, saving resources.
- Alexa can deploy more sophisticated machine learning models, so conversations with Alexa get better and better.
Inferentia and Inf1 instances aren’t restricted to Text-to-Speech tasks, either. Any application that deals with large amounts of image, video, speech, or text data, and runs substantial machine learning on that data, can benefit from this chip innovation. AWS customers like Snap, Autodesk, and Anthem are already using Inferentia and Inf1 instances to run video and image analytics, language and text processing (like translation, search, and extracting sentiment), and to improve recommendation engines. The upshot from this boost in machine learning muscle ranges from more song and movie recommendations that actually hit the mark for people, to building the kinds of complex models that could enable true autonomous driving.
Alexa customers will experience the Inferentia engine in better and more accurate voice assistant responses. And that tricky problem—natural intonation—will improve as Alexa’s text-to-speech models keep advancing. Your chats with Alexa will become increasingly natural, which is the whole point of the best technologies. It’s that three-punch combination of better, faster, lower cost.