top of page

How to design for voice UI

The rapidly growing field of voice UI requires us to design for interfaces that are much more auditory than they are visual.

Illustration by Vered Bloch.

Profile picture of Michael Craig

12.28.2020

8 min read

"Hey Siri, Do I need an umbrella today?"


"Alexa, remind me to call mom tomorrow at 9 AM."


"Hey Google, What's the fastest way to get to the airport from here?"


Voice User Interfaces (VUIs) have vastly improved the ways in which humans interact with computers. While voice UI has been around for some time now, people are more recently beginning to see the endless possibilities of what voice user interface can do.


Often, when we think of “user interface,” we think visually. We’ve grown familiar with screens we can swipe and buttons we can press. However, voice UI is not at all visual. Instead, it allows us humans to interact with a machine using our voice.


Well-known companies like Apple, Amazon, and Google have made everyday tasks easier with simple voice commands such as the ones mentioned earlier. Other companies have started incorporating these existing virtual assistants in their own products or are creating their own cutting-edge voice UI.


When it comes to designing interfaces for web and mobile, there are many factors to keep in mind, such as who you’re designing for, or that validating your design will improve user experience. Similarly, those designing voice user interfaces have to consider various factors and how to overcome obstacles in order to design a voice UI that is truly second nature.



Designing for humans


When designing a voice UI, it’s important to think about who will be interacting with your system.


How do they think?


How do they communicate on a day-to-day basis?


It's very possible that you have to design your interface with multiple audiences in mind.


For example, what if we were building a system that allowed people to book flights using speech? We would want to consider the steps required to accomplish this task on any platform. Once we understand the process, we can then apply it to a voice interface. Let’s book a flight from Atlanta, GA, to New York City.


That would involve the following steps:


  1. Choose dates to fly.

  2. Search for flights within the specified date range.

  3. Choose either one-way or round-trip.

  4. Choose the departing flight based on price and/or flight time.

  5. Choose the returning flight based on price and/or flight time.

  6. Choose flight or fare upgrades.

  7. Select trip protection.

  8. Confirm and pay.



A personal voice assistant in a home setting


Voice interfaces are by no means a replacement for visual interfaces. The two can complement each other rather than compete, leading to a better product.


Use natural language


Natural language is the ordinary speech that we use everyday in conversation. It doesn’t involve any planning or premeditation. It comes to us naturally, and adopting it in voice interfaces allows for a more intuitive experience.


Since mastering natural language requires advanced computational linguistics and semantics, there are still many not-so-great examples out there. In many voicemail assistants you receive a new voice message and after listening to it the first time you’re prompted:


“To hear the message again, say “Repeat”, to reply to the message, say “Reply”, to delete the message, say “Delete”.


This interaction doesn’t use intuitive speech and can actually be confusing. As the system is literally teaching commands on the fly, you actually have to think about what you’re doing, even if you already knew what action you were trying to perform.


In our previous flight-booking example, it’s evident that not all actions can be performed quickly, but we can make it easy for people to interact with and complete the process. How might the steps above translate to a natural human-to-computer interaction?


User: Book me a flight from Atlanta to New York City on August 4th.


System: Are you flying one-way or round trip?


User: Round trip.


System: Ok, I’ve found three great flights for you, the cheapest being $121 with a two-hour flight time. Would you like to book it?


User: Yes.


Notice that there are no special “voice commands” to start the interaction. It may be tempting to teach users certain verbiage, but, in reality, they will likely forget those, leading to frustration. Think about those terrible automated systems we can’t wait to bypass when calling a customer service line. Using natural language, both in the user initiating interactions and in the computer responses, leads to a truly instinctive experience.





Keep responses short and simple



Keeping with the flight example, we see that the system’s responses aren’t long. It’s usually best to keep phrases short and simple, so people aren’t overwhelmed with too much information at once. In reality, booking a flight may produce up to 50 flight options.


Trying to present them all would mean a never-ending voice interaction. Presenting the top options, however, offers users just enough information to keep going from. Try to limit every interaction to a max of two or three points, if possible.



Be helpful, even when you can’t help


Sometimes, accomplishing a specific task may not happen according to a person’s expectation. For instance, it’s possible there were no flights available for the selected dates.


Instead of abruptly ending the interaction by saying, “Sorry, there are no flights available,” try to adjust your response. This way, the system can search for nearby dates and present helpful suggestions to the user:


“There are no flights available for August 4th, but I did find flights for August 3rd and August 5th. Would you like to book for any of these?”



Consider technology constraints


Despite our best intentions and many rounds of research and user testing, voice interface design is still limited by technological constraints.


Will the computer system be able to recognize different accents, words, phrases - or even slang? Fortunately, advances in artificial intelligence like natural language processing (machine learning) are making challenges like these a thing of the past.


Recently, Microsoft, Amazon, and Intel invested in building more efficient processors specifically for voice-based applications. This technology would mean breakthroughs in performance for always-on voice devices.





Keep security and privacy in mind


While most voice interactions happen within the privacy of one’s home or vehicle, they will gradually take place in more and more public places like grocery stores, airports, and so forth.


In our example above of booking a flight, what if the person was trying to complete the payment while around other people? How would the system even detect such situations? Or what if the computer system in question dealt with protected health information (PHI)?


These are all important scenarios to consider. Here are some best practices for ensuring users’ security and privacy:



A woman talking to her mobile phone


Conceal payment information


In our example of booking a flight, payment information could be masked, so that people would be comfortable completing their tasks using the voice interface. In visual interfaces, it’s quite common for credit cards and bank info to show only the last 4 digits, like so: **** **** **** 4576.


In voice UI, however, we could go a step further by nicknaming payment forms. This could be taken care of in the initial setup of your product. When it’s time to complete a payment, the interaction might proceed as so:


System: Just to confirm, you’re booking a two-hour flight from Atlanta, GA, to New York City for $121. Would you like to continue with your payment?


User: Yes.


System: Ok, should I use your default payment, “Mike’s Chase card?”


User: Yes.


System: Ok, you’re all set! Your itinerary and boarding info will be sent to your email.


Notice that payment information is never revealed, which means this interaction could take place even in a non-private setting.



A time for everything


Depending on the setting, sometimes it just may not be appropriate for a person to use a voice UI in public. For instance, imagine a person who wants to schedule an appointment with their doctor. Even just mentioning the type of doctor or facility could expose private health matters. With this in mind, it’s important to think about how your voice UI will handle these situations. Can the interface determine when the setting is private? Can the voice UI pick up low voices or whispers?


Many of these factors will depend on the available technology , as mentioned earlier. While voice UIs certainly have their place, it may still be best to use visual elements to handle certain situations which require greater sensitivity and discretion.



How might we visually represent audible speech or content? How can we make seamless voice interfaces without relying too much on visual cues?


When to reinforce voice


It should be said that voice interfaces are by no means a replacement for visual interfaces. Even as technology advances, we will continue to find that the two can complement each other rather than compete, leading to a better product. Depending on your system, there are times when it would be good to reinforce auditory comments with visual elements.


Of course, this requires a good balance. How might we visually represent audible speech or content? How can we make seamless voice interfaces without relying too much on visual cues? When is it appropriate to reinforce audio? Many of the answers to these questions depend largely on the product you are building.



Managing files using voice commands, design by Gleb Kuznetsov



When an interaction is taking place (visual feedback)


Amazon Alexa and Apple’s Siri interfaces do a good job of giving visual feedback with subtle cues to let the person know that an auditory interaction is taking place.


Most of the time, the visual elements required are very minor. A little, however, goes a long way. The simple flashing of light or a pulsating icon lets the person know that:


  1. The device is working.

  2. The voice interface is responding to, or will soon respond to, what is said.

Visual feedback is especially important with voice chat apps or voice-enabled TV interfaces, where speech-to-text is rendered on screen. This also allows people to see if what they said registered properly.



Voice UI design providing visual feedback on Apple's Siri
Apple’s Siri interface accompanies an auditory interaction with visual feedback.


When voice is an accessory


Voice interfaces are being integrated into many systems. Sometimes the voice UI is an accessory that simply provides an alternative way to use a product. A good example of this is Siri on macOS or Cortana on Windows. Both operating systems are primarily graphical user interfaces, not meant to be used solely with voice. Their virtual assistants, however, do add unique ways for people to accomplish tasks.


When we ask Siri or Cortana: “What’s on my schedule today?”, we’re usually presented with what’s on our calendar. In this example, voice actually enhances the visual interface by allowing people to accomplish tasks quicker. Instead of going through the process of opening the calendar to schedule an event, they can simply ask the voice assistant to do so for them.


A similar approach can be seen in the integration of voice into mobile apps. Here, the primary focus may be visual, but voice can enhance your customer’s experience. Repetitive or mundane tasks could easily be handled with voice interactions, allowing your customers to accomplish what they really need to quickly and easily.



Differentiating system vs. human


Technology has advanced to the point where we can almost have full-on conversations with computers. Chatbots within apps or virtual assistants can show these conversations in real time. It’s therefore good practice to visually differentiate audible speech coming from a computer system, versus human speech.


This can be done in the same manner as messaging apps. Using a different color, bold text, or other design features can ensure that people don’t get lost in the conversation.



Scheduling a meeting using voice commands, design by Denislav Jeliazkov



Designing voice UIs for those most in need


Perhaps those who could benefit from voice UIs the most are individuals with impairments, especially visual impairments. Voice UIs can greatly enhance the accessibility of our systems. While we’re certainly making strides, we still have some ways to go before we’re at that point.


Recently, I was helping a friend of mine set up a mobile phone, and my attempts to making it usable for him were rather agonizing. The talkback feature for example, while somewhat helpful, still requires that a person use visual elements and know where to tap on the screen in order to use it.


For my friend, this wasn’t possible, as he is blind. In situations like these, visual cues can’t be relied on. How can we make voice interfaces seamless and operational even for those with disabilities?


Since most phones have gyroscopes and accelerometers, it would be nice to have a visual impairment mode that detects when a phone is picked up, and a person could be potentially prompted with the following:


“What would you like to do?”, or “How can I help?”.


From here, the person could then accomplish what they intended, whether that’s making a call, listening to their email, etc.


When designing for accessibility, audible cues become especially important. Using techniques similar to those that we’re already familiar with from different virtual assistants, we can design seamless interfaces that work well even without any visual representation.


The other best practices of voice UI design, like using natural language and keeping responses short and simple, are also a great way to ensure that those with visual impairments can interact with the systems we create.



Conclusion


Voice design has progressed considerably since the days of robotic-sounding screen readers and automated phone attendants. Soon, we can expect that voice interfaces will be on the same level as interacting with visual interfaces.


While neither can replace the other, visual and audio design can complement one another and help all customers, even those with disabilities, to accomplish tasks instinctively. Being prepared to face and overcome new challenges will enable the design community to unlock the full potential of voice user interface.


RELATED ARTICLES