The Low-Down: The More Human a Digital Assistant Sounds, the More That Users Expect It To Do

The more human a digital assistant sounds, the more functionality a user expects from it.

The challenge is that it is easier to improve a device's speech than it is to enhance its capabilities. Which means Amazon, Apple and Google want to improve how their products sound - but not too much too soon. JL

Liz Stinson reports in Wired:

Advanced language tags could do for computer-generated speech what punctuation and emoji did for text: increase its informational bandwidth. Intonation could lead to more efficient phrasing and less ambiguity. It could also give Alexa an emotional advantage over digital assistants from Apple and Google. (But) the wider the gap between how an assistant sounds and what it can do, the greater the distance between its abilities and what users expect from it.
Ask Alexa about the weather, and it’ll tell you it’s sunny and 75 in a pleasant monotone. Prompt it to tell you a joke, and it’ll offer a pun in its signature staccato. Suggest that it sing a song, and it’ll belt out an auto-tuned country ballad. Amazon’s virtual assistant boasts a number of clever, humanlike abilities—but, as its voice betrays, Alexa is still just a robot.
To help rid Alexa of its cyborgian lilt, Amazon recently upgraded its speech synthesis markup language tags, which developers use to code more natural verbal patterns into Alexa’s skills, or apps. The new tags allow Alexa to do things like whisper, pause, bleep out expletives, and vary the speed, volume, emphasis, and pitch of its speech. This means Alexa and other digital assistants might soon sound less robotic and more human. But striking a balance between these two extremes remains a significant challenge for voice interaction designers, and raises important questions about what people really want from a virtual assistant.

Talk This Way
Let’s dispense with the promising stuff first. Advanced language tags could do for computer-generated speech what punctuation and emoji did for text communications: increase its informational bandwidth. Simple markup language allows voice assistants to distinguish 1996 from 1,996, or a panda that eats shoots and leaves from one that eats, shoots, and leaves. Advanced tags allow them to convey much more. You know how you interpret the text message “sounds great” differently than “sounds great ;)”? The ability to intonate will make digital assistants capable of similarly nuanced expression.

You’ve probably heard the design maxim that form should follow function. Alexa has no physical form to speak of, but its purpose should inform its persona.
A more nuanced assistant is arguably more helpful. “The musical elements of speech help you set expectations for what’s coming,” says Laura Wagner, a psycholinguist at Ohio State University. Intonation could lead to more efficient phrasing and less ambiguity. It could also give Alexa an emotional advantage over digital assistants from Apple and Google. “We’re going to love it more if it sounds human,” Wagner says. Evidence suggests that people feel more connected with objects capable of “contingent interaction,” the responsive back-and-forth of talking with another person. “The more human Alexa sounds, the more I’m going to want to trust her and use her,” Wagner says.
That, of course, explains why Amazon wants to make Alexa sound as human as possible.

Mind the (Expectation) Gap
But Amazon risks making Alexa sound too human, too soon. In February, the company unveiled “speechcons”—dozens of interjections like argh; cheerio; d’oh; and bazinga (no, really, bazinga) that Alexa enunciates more expressively than other words. Amazon wants to add a layer of personality to its virtual assistant, but quirks like that could make Alexa less useful.
“If Alexa starts saying things like hmm and well, you’re going to say things like that back to her,” says Alan Black, a computer scientist at Carnegie Mellon who helped pioneer the use of speech synthesis markup tags in the 1990s. Humans tend to mimic conversational styles; make a digital assistant too casual, and people will reciprocate. “The cost of that is the assistant might not recognize what the user’s saying,” Black says. A voice assistant’s personality improving at the expense of its function is a tradeoff that user interface designers increasingly will wrestle with. “Do we want a personality to talk to or do we want a utility to give us information? I think in a lot of cases we want a utility to give us information,” says John Jones, who designs chatbots at the San Francisco design consultancy Fjord. Just because Alexa can drop colloquialisms and pop culture references doesn’t mean it should. Sometimes you simply want efficiency. A digital assistant should meet a direct command with a short reply, or perhaps silence—not booyah! (Another speechcon Amazon added.)
Personality and utility aren’t mutually exclusive, though. You’ve probably heard the design maxim form should follow function. Alexa has no physical form to speak of, but its purpose should inform its persona. But the comprehension skills of digital assistants remain too rudimentary to bridge these two ideals. “If the speech is very humanlike, it might lead users to think that all of the other aspects of the technology are very good as well,” says Michael McTear, coauthor of The Conversational Interface. The wider the gap between how an assistant sounds and what it can do, the greater the distance between its abilities and what users expect from it.

Tell Me What You Want
This raises an important question: What do people want from a virtual assistant? After all, the concerns of interaction designers should reflect those of users—but you wonder who benefits most from the changes they make. Amazon’s efforts to make Alexa sound as human as possible suggest that users expect their artificially intelligent sidekicks to do more than turn on their lights or provide a weather forecast. They want these devices to understand them. Connect with them. Maybe even—don’t laugh—date them.But it would be naïve to ignore the motives of the companies building these products. Amazon wants to sell you things (after all, its design guidelines identify Alexa owners not as “users” but “customers”), and a more emotive assistant could be leveraged to that end. Amazon already tries to harvest sentiment from the voices of Alexa users; it stands to reason that an AI more capable of expressing emotions would also be more capable of analyzing—and manipulating—your own.
Creepy, yes, but also promising. Amazon might use Alexa’s expressiveness to sell you stuff, but social robots could use the same technology to deliver, say, better care to the elderly. As companies continue developing assistants that sound less mechanical, the line between utility and companionship will continue blurring. Will it reach the point where Alexa acts like an emotionally intelligent friend? Perhaps. Amazon remains some ways away from creating a virtual assistant that can anticipate your needs and desires; until then, it still faces plenty of unanswered questions that can help shape how these assistants fit into your life.