2020 has been an extremely challenging year. As I reflect on the recent months, it is deeply gratifying to see that Alexa is helping our customers in these difficult times by making it easier for them to connect with friends and family, stay informed, and find more reasons to smile. What makes me extremely proud is how our teams have found ways to invent on behalf of our customers.
Today, I am excited to show you the AI advancements bringing us closer to our long-term vision of making interactions with Alexa as simple as speaking to another person.
Humans are adaptive by nature. When we misunderstand someone during a conversation, we are able to quickly pick up on nuances in how they respond or ask clarifying questions. Alexa already uses similar self-learning to automatically correct her mistakes by learning from customer feedback. These feedback signals include vocal frustration such as "Alexa, that’s wrong," or an interruption such as "Alexa, stop." Once Alexa determines a particular action was unsatisfactory, she automatically corrects herself.
Today, we’re taking this self-learning a step further by giving you the ability to directly teach Alexa. This new capability helps Alexa get smarter by asking questions to fill gaps in her understanding—just like we do as humans.
To do this, Alexa uses machine learning to determine whether your request can be a trigger for a teachable moment. If Alexa makes this determination, Alexa will ask you for information that helps her learn. As an example, when reading the latest best seller, Alexa won’t automatically know that my preferred setting is 40% brightness when I ask, "Alexa, set the light to Rohit’s reading mode." With interactive teaching, Alexa learns these definitions and associated actions instantaneously, and they are stored only for your account for future use.
With Alexa, we will start by making this teaching capability available for smart home devices in the coming months, and expand to other areas in the future. This is an exciting step forward not just for Alexa, but for all AI services that rely on end users to teach them.
Natural interactions with Alexa are not just about how well she understands your requests, but also how she responds to those requests. That’s why last year we introduced Neural Text-to-Speech (NTTS), a technology that achieves a more natural sounding and higher quality voice by applying deep learning to generate the speech signal.
When we speak to friend or family member, we pick up on verbal and non-verbal cues, and adapt our responses accordingly. This is a challenge for an AI, but thanks to advances in our NTTS synthesis technology, Alexa will now adapt her responses based on the context of the conversation by adjusting her tone, stressing certain words, and adding pauses and even breaths. We call this speaking style adaptation.
I believe Alexa’s more natural and expressive responses will offer new and delightful experiences when we start rolling this out later this year.
The ability to interact with ambient devices by just saying “Alexa” is delightful. As proud as I am of being part of the team that made this science fiction become a reality six years back, we recognized that this is not yet close to human-human conversation. As humans, our conversations are free flowing—we talk over each other, speak at the same time, and don’t always use each other's names. All of this is very natural to us, but incredibly hard for an AI like Alexa.
Today, I am excited to introduce natural turn taking, a major step forward in conversational AI that allows you to speak to Alexa without using a wake word during the course of a conversation. This advancement allows you to interact at your own pace, even when multiple people are talking. You can ask “Alexa, join my conversation,” and Alexa will join in the conversation to help you and friend decide what pizza to order, or get a movie recommendation for a night at home with your family.
We need to solve several challenges to offer this experience. For example: Are people speaking to each other? Should Alexa join the conversation? And if she does, who should she respond to?
To address these challenges, we had to go beyond understanding the words you use within a request, to multi-sensory AI. Alexa will use a combination of acoustic (ambient sounds versus human speech), linguistic (questions directed to Alexa versus another human), and visual (looking at the device versus at each other) cues to determine that a particular interaction is directed at her. She will then use the context of the conversation to decide how to respond or which specific action to take.
We look forward to bringing you this transformative capability next year.
I’m excited to introduce these advancements that demonstrate how we continue to make strides in conversational AI and advance the state-of-the-art in machine learning.
You can find additional details on the science behind these features in the following articles from the Amazon Science blog: