Voice Systems

This draft is out of date, an update that addresses new research and new technologies is being worked on at Conversational Voice Systems .

Introduction

Voice systems are systems that a user interacts with by listening to spoken prompts from an automated system. The user responds by either pressing keys on a telephone keypad or by speaking (or both). Voice systems are widespread in telephone self-service applications for customer support.

It is worth noting that many crucial systems are dependent on this technology such as emergency notification, healthcare appointment reminders or prescription refilling, and others. Therefore full accessibility needs to be supported.

Voice systems are often implemented with the W3C VoiceXML standard and supporting standards from the Voice Browser Working Group.

See [[voicexml20]] and [[voicexml21]]

However, it is important to emphasize that issues of cognitive accessibility for voice systems apply without regard to whether a voice system is implemented using the W3C voice standards or with a proprietary technology.It is impossible for a user to tell what technologies are used in the underlying voice platform, but the usability principles will be the same whatever the underlying technology is.

An example use case may be as follows:

The user may be asked "For sports press 1, For weather press 2, For Stargazer astrophysics press 3." The system then waits for a response.
Accessibility is discussed for the hard of hearing, and WCAG and WAI specification are cited as being relevant (see [[VoiceXML2.0#accessibility]]) Beyond that, no examples or concerns are identified for cognitive accessibility.

Challenges for People with Cognitive Disabilities

Voice technology can be very problematic for people with cognitive disabilities, due to its heavy demands on memory and on the ability to understand and produce speech in real time.

Effect of memory impairments on users' ability to understand and respond to prompts

A good working memory is essential for using menu-based systems that present several choices to the user and ask them to select one choice, whether by speaking or through a key presss. The user needs hold multiple pieces of transitory information in the mind such as the number that is being presented as an option, whilst processing the terms that follow.

A good short term memory (several seconds) is essential so that the user can remember the number or the term.

Without these functions the user is likely to select the wrong number.

Executive function

Users need to be able to decide when to act on a menu choice. While a menu is being presented, should they wait to hear more options or should they select a choice that seems correct before hearing all the options?

Limitations of executive function may also cause problems when the system response is too slow. The user may not know whether their input has registered with the system, and consequently may press the key or speak again.

Effect of impaired reasoning

The use needs may need to compare similar options such as "billing", "accounts", "sales" and decide which is the service that is best suited to solve the issue at hand. Without strong reasoning skills the user is likely to select the wrong menu option.

Advertisements and additional, unrequested information also increase the amount of processing required.

Effect of attention related limitations

The use needs to focus on the different options and select the correct one. A person with impaired attention may have difficulties maintaining the necessary focus for a long or multi level menu. Advertising and additional, unrequested information also make it harder to retain attention.

Effect of impaired language and auditory perception related functions

The user needs to interpret the correct terms and match them to their needs within a certain time limit. This involves speech perception and language understanding: sounds of language are heard, interpreted and understood, within a given time.

Effect of impaired speech and language production functions (for speech-recognition systems)

The user needs to be able to formulate a spoken response to the prompt before the system "times out" and generates another prompt. In the most common type of speech-recognition system (directed dialog) the user only needs to be able to speak a word or short phrase. However, some systems ("natural language systems") allow the user to describe their issue in detail. While this feature is an advantage for some users because it does not require them to remember menu options, it can be problematic for users with disorders like aphasia who have difficulty speaking.

Effect of reduced knowledge

The user needs to be familiar with the terms used in the menu, even if they are not relevant to the service options required.

Proposed solutions

Human backup

For users who are unable to use the automated system, it must be possible to reach a human, either in a call center or another operator, through an easy transfer process (that is, not by being directed to call another phone number).
There should be a reserved digit for requesting a human operator. The most common digit used for this purpose is "0"; however, if another digit is already in widespread use in a particular country, then that digit should always be available to get to a human agent. Systems especially should not attempt to make it difficult for users to reach an agent through the use of complex digit combinations. This could be enforced by requiring implementations to not allow the reserved digit to mean anything other than going to an operator.
Other digits similarly could be used for specific reserved functions, keeping in mind that too many reserved digits will be confusing and difficult to learn. Remembering more than one or two reserved digits may be problematic for some users, but repeated verbal recitals of the reserved digits will also be distracting.

User settings

User-specific settings can be used to customize the voice user interface, keeping in mind that the available mechanisms for invoking user-specific settings are minimal in a voice interface (speech or DTMF tones). If it is difficult to set user preferences, they won't be used. Setting preferences by natural language is the most natural ("slow down!") but is not currently very common.

Extra time should be a user setting for both the speed of speech and ability for the user to define if they need a slower speech or more input time etc.
Timed text should be adjustable (as with all accessible media).
The user should be able to extend or disable time out as a system default on their device
Error recovery should be simple, and take you to a human operator. Error response should not though the user off the line or send them to a more complex menu. Preferably they should use a reserved digit.
Timed text should be adjustable (as with all accessible media).
Advertisement and other information should not be read as it can confuse the user and can make it harder to retain attention.
Terms used should be as simple as possible.
Examples and advice should be given on how to build a prompt that reduces the cognitive load
1. Example 1: Reducing cognitive load: The prompt "press 1 for the the secretary," requires the user to remember the digit 1 while interpreting the term secretary. It is less good then the prompt "for the secretary (pause): press 1" or " for the secretary (pause) or for more help (pause): press 1"
2. Example 2: Setting a default for a human operator as the number 0

Follow best practices in general VUI design

Standard best practices in voice user interface apply to users with cognitive disabilities, and should be followed. A good reference is published by The Association for Voice Interaction Design Wiki [AVIxD]. Another good reference is [ETSI ETR 096]. Some examples of generally accepted best practices in voice user interface design:

Pauses are important between phrases in order to allow processing time of language and options.
Options in text should be given before the digit to select, or the instruction to select that option. This will mean that the user does not need to remember the digit or instruction whilst processing the term. For example: The prompt "press 1 for the the secretary," requires the user to remember the digit 1 while interpreting the term "secretary". A better prompt is "for the secretary (pause): press 1" or " for the secretary (pause) or for more help (pause): press 1"
Error recovery should be simple, and take the user to a human operator if the error persists. Error responses should not end the call or send the user to a more complex menu.
Advertisements and other extraneous information should not be read as it can confuse the user and can make it harder to retain attention.
Terms used should be as simple and jargon-free as possible.
Tapered prompts should be used to increase the level of prompt detail when the user does not respond as expected.

See the AVIxD wiki cited above for additional recommendation and detail.

Considerations for Speech Recognition

For speech recognition based systems, an existing ETSI standard for voice commands for many European languages exists and should be used where possible [ETSI 202 076], keeping in mind that expecting people to learn more than a few commands places a burden on the user.
Natural language understanding systems allow users to state their requests in their own words, and can be useful for users who have difficulty remembering menu options, or who have difficulty mapping the offered menu options to their goals. However, natural language interfaces can be difficult to use for users who have difficulty producing speech or language. Directed dialog (menu-based) fallback or transfer to an agent should be provided.

Follow requirements of legislation

For example, the U.S. Telecommunications Act Section 255 Accessibility Guidelines [Section255] paragraph 1193.41 Input, control, and mechanical functions, clauses (g), (h) and (i) apply to cognitive disabilities and require that equipment should be operable without time-dependent controls, the ability to speak, and should be operable by persons with limited cognitive skills.

Technology-based solutions

Recent developments in call center technology may be helpful for users with cognitive disabilities.

Visual IVR. When a call comes in on a smartphone, the system can ask the user if they want to switch over to a visual interface which mirrors the voice interface. This allows a user to see the prompts instead of having to remember them.
Adaptive voice interface. This is a technology that is sensitive to the user's behavior and changes the voice interface dynamically. For example, it can slow down or speed up to match the user's speech rate [Adaptive].
Tapered prompts. Best practices in voice user interface design include providing several different prompts for each point in the interaction. The different prompts are used based on the user's behavior. For example, if the user takes a long time to respond to a prompt, a simpler or more explanatory version of the prompt by be used instead of the default.
Human assistance. Although the user interacts normally with the voice system, in case the system is unable to process the user's speech, a human agent acts behind the scenes to perform the necessary processing. This would allow users with a limited ability to speak (whose speech might not be recognized by a speech recognizer) to interact with the system.

Status of these solutions

Note. The above proposed solutions have been tested for users in the general population and have been shown to improve the usability of voice systems, although the extent to which they have been tested with users with cognitive disabilities is not clear.

Currently VoiceXML does not directly enforce accessibility for people with cognitive disabilities. However, a considerable literature on voice user interface design exists and is in many cases very applicable to cognitive accessibility for voice systems. Developers must become aware of these resources and of the need to design systems with these users in mind.

References

[AVIxD] The Association for Voice Interaction Design Wiki
[ETSI 202 076] ETSI ES 202 076 V2.1.1 (2009-06)
[ETSI ETR 096] ETSI ETR 096 Human factors guidelines for the design of minimum phone based user interface to computer services
[Section255] Telecommunications Act Section 255 Accessibility Guidelines
[Adaptive]Adaptive Voice White Paper