Intelligent Personal Assistant Architecture

Architecture and Potential for Standardization Version 1.3

Latest version: Last modified: April 4, 2024 https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paArchitecture/paArchitecture-1-3.htm (GitHub repository); HTML rendered version
Editors: Dirk Schnelle-Walka
Deborah Dahl, Conversational Technologies

Copyright © 2019-2024 the Contributors to the Voice Interaction Community Group, published by the Voice Interaction Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.

Abstract

This document describes a general architecture of Intelligent Personal Assistants and explores the potential for standardization. It is meant to be a first structured exploration of Intelligent Personal Assistants by identifying the components and their tasks. Subsequent work is expected to detail the interaction among the identified components and how they ought to perform their task as well as their actual tasks respectively. This document may need to be updated if any changes result of that detailing work. It extends and refines the description of the previous versions Architecture and Potential for Standardization Version 1.2. The changes primarily consist of clarifications and additional architectural details in new and expanded figures, include input and output data paths.

Status of This Document

This specification was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

Comments should be sent to the Voice Interaction Community Group public mailing list (public-voiceinteraction@w3.org), archived at https://lists.w3.org/Archives/Public/public-voiceinteraction

Introduction
Problem Statement
Architecture
Error Handling
Use Case Walk Through
Potential for Standardization
Footnotes
Appendix
1. Acknowledgments
2. Abbreviations

1. Introduction

Intelligent Personal Assistants (IPAs) are now available in our daily lives through our smart phones. Apple’s Siri, Google Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more are helping us with various tasks, like shopping, playing music, setting a schedule, sending messages, and offering answers to simple questions. Additionally, we equip our households with smart speakers like Amazon’s Alexa or Google Home which are available without the need to pick up explicit devices for these sorts of tasks or even control household appliances in our homes. As of today, there is no interoperability among the available IPA providers. Especially for exchanging learned user behaviors this is unlikely to happen at all.

Furthermore, in addition to these general-purpose assistants, there are also specialized virtual assistants which are able to provide their users with in-depth information which is specific to an enterprise, government agency, school, or other organization. They may also have the ability to perform transactions on behalf of their users, such as purchasing items, paying bills, or making reservations. Because of the breadth of possibilities for these specialized assistants, it is imperative that they be able to interoperate with the general-purpose assistants. Without this kind of interoperability, enterprise developers will need to re-implement their intelligent assistants for each major generic platform.

This document is a first step in our strategy for IPA standardization. It describes a general architecture of IPAs and explores the potential areas for standardization. It focuses on voice as the major input modality. We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best practices and seek adjacencies for creation or investment. The overall concept is not restricted to voice but also covers purely text based interactions with so-called chatbots as well as interaction using multiple modalities. Conceptually, the authors also define executing actions in the user's environment, like turning on the light, as a modality. This means that components that deal with speech recognition, natural language understanding or speech synthesis will not necessarily be available in these deployments. In case of chatbots, speech components will be omitted. In case of multimodal interaction, interaction modalities may be extended by components to recognize input from the respective modality, transform it into something meaningful and vice-versa to generate output in one or more modalities. Some modalities may be used as output-only, like turning on the light, while other modalities may be used as input-only, like touch.

2. Problem Statement

Currently, users are mainly using the IPA Provider that is shipped with a certain piece of hardware. Thus, selection of a smart phone manufacturer actually determines which IPA implementation they are using. Switching among different IPA providers also involves switching the manufacturer, which requires high costs and getting used to a new user interface specific to the new manufacturer. On the one hand users should have more freedom in selecting the IPA implementation they want. However, they are bound to use the service that is available in that implementation but which may not be what they necessarily prefer. On the other hand, IPA providers, which mainly produce the software, must also function as hardware manufacturers to be successful.

Moreover, we are also seeing the emergence of independent conversational agents, owned and operated by independent enterprises, and built on either white label platforms or of best-of-breed components by 3rd party development agencies. This may largely free IPA development from hardware. Such a market transition creates an ever greater impetus for this work.

Finally, manufacturers also have to take care to port existing services to their platform. Standardization would clearly lower the needed efforts for porting and thus reduce costs. Additionally, it may also pave the way for interoperability among available IPA providers. Tasks may be transferred, partially or completely to other IPAs.

In order to explore the potential for standardization, a typical usage scenario is described in the following section.

2.1 Use Cases

This section describes potential usages of IPAs.

2.1.1 Travel Planning

A user would like to plan a trip to an international conference and she needs visa information and airline reservations. She will give the intelligent personal assistant (IPA) her visa information (her citizenship, where she is going, purpose of travel, etc.) and it will respond by telling her the documentation she needs, how long the process will take and what the cost will be. This may require the personal assistant to consult with an auxiliary web service or another personal assistant that knows about visas.

Once the user has found out about the visa, she tells the IPA that she wants to make airline reservations. She specifies her dates of travel and airline preferences and the IPA then interacts with her to find appropriate flights.

A similar process will be repeated if the user wants to book a hotel, find a rental car, or find out about local attractions in the destination city. Booking a hotel as part of attending a conference could also involve finding out about a designated conference hotel or special conference rates, which, again, could require interaction with the hotel or the conference's IPA's.

2.1.2 Emergency Events

User encounters emergency situations that requires them to use their hands while administering medical care, driving or operating machinery. Manual interactions on control panels, keyboards or touch pads can impede life saving activities and diminish focus while operating sensitive vehicles, devices and machinery. User would benefit from a secure, interoperable, voice interactive system that can be used to access necessary information, keeping hands free to perform these actions.

Examples of emergency applications include:

User interacts with a voice-activated GPS systems while navigating evacuation routes and alternate travel routes in extreme weather conditions, which could include washed out, flooded roadways, low visibility from smoke and haze and other conditions requiring focused, manual control. System has access to and can use voice query of real-time weather and road condition databases.
User interacts with a GPS system to privately and securely communicate their location to emergency services or other entities.
User encounters a choking victim and accesses audio-based emergency medical care instructions while providing life saving trauma care such as CPR or Epipen.
User accesses real time, audio translation/transcription services while caring for someone who speaks a different language.

All of these use cases benefit from voice interaction systems that have:

Both audio and visual output as well as other accessible, multimodal output formats.
Multiple ways to control (stop, start, go back, go forward, change rate of speed) either verbally or manually via GUI or physical control.
Ability to securely access information about the person receiving care such as age, medical history.
Interoperability with EHR systems (personal health information systems).
Conforms with health data privacy laws.

Interoperability:

How to Discover it (Where is it? Who produces it?)
How to Interact with it (What format is it, etc.)

2.2 Roles and Responsibilities

The following roles and responsibilities following the RACI (responsible, accountable, consulted, informed) are identified

Role	R	A	C	I
Platform provider	x	x
Content Owner		x		x
Developer	x		x
Designer and Application Developer	x
System Integrator	x
User

Platform provider

Accountable and responsible for the operative performance of the infrastructure (uptime, security, performance as measured against service-level agreements (SLAs) with clients, customers, and partners, inclusive of on-premises hardware and cloud services.

Content Owner

Accountable for the UX, content, and operational performance of any and all assistants that represent the brand and its services to brand constituents (including clients, customers, and internal stakeholders).
Example: a financial services enterprise, such as a bank.

Developer

Responsible to the content owner for the

selection of the hosting and infrastructure services
definition and development of the IPA
design and definition of IPA possibilities and basic functionalities: activation strategies, architecture tailoring, hardware specifications
may define and develop conversational content

Example: Most often, an independent enterprise specializing in conversational assistance.

Designer and Application Developer

Responsible to the content owner for

definition, design of the conversational interaction on behalf of a brand or client organization (Developer is consulted)
definition, development, editing of content on behalf of a brand or client organization
creating applications extending the basic functionalities of the IPA

System Integrator

Responsible to content owner for

Business process analysis: where, how conversational assistance will create value
Definition, development of business process transformation flow and interfaces -- where/how/through what knowledge is transmitted to action
Creation and integration of access for conversational assistant into necessary corporate data sources
Development of system/process ROI and NPV analysis of investment

User

Uses the IPA

3. Architecture

In order to cope with such use cases as those described above an IPA follows the general design concepts of a voice user interface, as can be seen in Figure 1.

The architecture described in this document follows the SOLID principle introduced by Robert C. Martin to arrive at a scalable, understandable and reusable software solution.

Single responsibility principle: The components should have only one clearly-defined responsibility.
Open closed principle: Components should be open for extension, but closed for modification.
Liskov substitution principle: Components may be replaced without impacts onto the basic system behavior.
Interface segregation principle: Many specific interfaces are better than one general-purpose interface.
Dependency inversion principle: High-level components should not depend on low-level components. Both should depend on their interfaces.

This architecture aims at following both, a traditional partitioning of conversational systems, with separate components for speech recognition, natural language understanding, dialog management, natural language generation, and audio output, (audio files or text to speech) as well as newer based approaches utilizing generative AI. This architecture does not rule out combining some of these components in specific systems.

The following figure 1 shows the basic architecture of an IPA

Basic IPA Architecture — Fig. 1 Basic architecture of an IPA

Both architectures aim at serving, among others, the following most popular high-level use cases for IPAs

Question Answering or Information Retrieval
Executing local and/or remote services to accomplish tasks

This is supported by a flexible architecture that supports dynamically adding local and remote services or knowledge sources such as data providers. Moreover, it is possible to include other IPAs, with the same architecture, and forward requests to them, similar to the principle of a Russian doll (omitting the Client Layer). All this describes the capabilities of the IPA. These extensions may be selected from a standardized marketplace. For the reminder of this document, we consider an IPA that is extendible via such a marketplace.

Not all components may be needed for actual implementations, some may be omitted completely. However, we note them here to provide a more complete picture. This architecture comprises three layers that are detailed in the following sections

Client Layer
Dialog Layer
External Data / Services / IPA Providers

Actual implementations may want to distinguish more than these layers. The assignment to the layers is not considered to be strict so that some of the components may be shifted to other layers as needed. This view only reflects a view that the Community Group regard as ideal and to show the intended separation of concerns.

3.1 Client Layer

The Client Layer contains the main components that interface with the user. The following figure details the view onto the Client Layer shown in Figure 1.

3.1.1 Capture

Capture devices or modality recognizers are used to capture multimodal user input, such as voice or text input. Additional input modalities can be employed that capture input with a specific modality recognizers. Additional input may be gathered from Local Data Providers

3.1.1.1 Microphone

The microphone is used to capture the voice input of a user as a primary input modality.

3.1.1.2 Keyboard

The keyboard may be optionally used to capture the text input if the IPA accepts this input modality.

3.1.2 Presentation

Presentation devices or modality synthesizers are used to provide system output to the user. Additional output modalities can be employed that render their output with a specific modality synthesizer. It is not always required that a verbal auditory output is made as a reply to a user. The user can also become aware of the output as a consequence of an observable action as a result of a Local Service within the Client Layer or an External Services call from the External Data / Services / IPA Providers Layer. In these cases an additional nonverbal auditory output may be considered.

3.1.2.1 Speaker

The loudspeaker is used to output replies as verbal auditory output in the shape of spoken utterances as a primary output modality. Utterances may be accompanied by nonverbal auditory output such as

earcons,
auditory icons or
music.

3.1.2.2 Display

The display may be optionally used to present text output if the IPA supports this output modality.

3.1.3 IPA Client

Clients enable the user to access the IPA via voice with the following characteristics.

Usually, IPA Clients make use of a Microphone to capture the spoken input and a Speaker to provide responses.
The client is activated by means of a Client Activation Strategy.
As an extension IPA Clients may also capture input via text and output text.
As an extension IPA Clients may also capture input from a specific modality recognizer.
As an extension IPA Clients may also capture contextual information, e.g. location, that it obtains from Local Data Providers.
As an extension an IPA Client may also receive commands to be executed locally in the Local Services.
As an extension an IPA Client may also receive multimodal output to be rendered by a respective modality synthesizer.
IPA Clients may need to reference to a session identifier.

3.1.3.1 Client Activation Strategy

The Client Activation Strategy defines how the client gets activated to be ready to receive spoken commands as input. In turn the Microphone is opened for recording. Client Activation Strategies are not exclusive but may be used concurrently. The most common activation strategies are described in the table below

Client Activation Strategy	Description
Push-to-talk	The user explicitly triggers the start of the client by means of a physical or on-screen button or its equivalent in a client application.
Hotword	In this case, the user utters a predefined word or phrase to activate the client by voice. Hotwords may also be used to preselect a known IPA Provider. In this case the identifier of that IPA Provider is also used as additional metadata augmenting the input. This hotword is usually not part of the spoken command that is passed for further evaluation.
Gesture-to-talk	The user triggers the start of the client by means of a gesture, e.g. raising the hand to be detected by a sensor.
Local Data Providers	In this case, a change in the environment may activate the client, for example if the user enters a room.
...	...

The usage of hotwords includes privacy aspects as the microphone needs to be always active. Streaming to the components outside the user's control should be avoided, hence detection of hotwords should ideally happen locally. With regard to nested usage of IPAs that may feature their own hotwords, the detection of hotwords might be required to be extensible.

3.1.3.2 Local Service Registry

A registry for all Local Services and Local Data Providers that can be accessed by the client

The Local Service Registry maintains a list of Local Services and Local Data Providers along with their unique identifier that may be accessed by the IPA Client or the Context.
The Local Service Registry may allow to add Local Services and Local Data Providers at runtime.
Local Services and Local Data Providers may be obtained from a standardized market place.

3.1.3 Local Services

Local services can be used to execute local actions in the user's local environment. Examples include turning on the light or starting an application, for instance a navigation system in a car.

3.1.4 Local Data Providers

Local Data Providers capture input that is accessible in the user's local environment. They can be used to provide additional input to the IPA Client or to provide additional information that is needed to execute services. An example for the latter is the state of the light, either turned on or turned off.

3.2 Dialog Layer

The Dialog Layer contains the main components to drive the interaction with the user. The following figure details the high-level view of the Dialog Layer shown in Figure 1. The dialog layer may either be traditionally NLU-based as shown in Figure 2 a) or based on Generative AI as shown in Figure 2 b).

Fig 2 b) Generative AI-based Dialog Layer

3.2.1 IPA Service

The general IPA Service API mediates between the user and the overall IPA system. The service layer may be omitted in case the IPA Client communicates directly with Dialog Manager. However, this is not recommended as it may contradict the principle of separation-of-concerns. It has the following characteristics

The IPA Service receives audio input from the IPA Client and forwards it simultaneously to the local IPA, i.e. the ASR and nested IPAs via the Provider Selection Service.
In case the audio input is augmented with metadata, such as location, the metadata are also simultaneously forwarded to the local IPA, i.e., the NLU and the nested IPAs via the Provider Selection Service.
In case the metadata augmenting the user input contain a pre-selection of an IPA Provider the input is only forwarded to the Provider Selection Service.
Additionally, the IPA Service may receive multimodal input via the modality recognizers from the IPA Client and forwards that in addition to the NLU as additional semantic interpretation input to be considered. Deriving semantic interpretation may require incorporation of dedicated modality specific components.
Alternatively IPA Service may receive text input from the client and forwards that instead to audio input. In this case the ASR is omitted.
The IPA Service functions receives audio output from the TTS and forwards it to the IPA Client.
Additionally, the IPA Service may receive multimodal output from the Dialog Manager and forwards that in addition to audio input to the modality renderers.
Alternatively IPA Service may receive text output from the NLG and forwards it IPA Client. In this case the TTS is omitted.

3.2.2 ASR

The Automated Speech Recognizer (ASR) receives audio streams of recorded utterances and generates a recognition hypothesis as text strings for the local IPA. Conceptually, ASR is a modality recognizer for speech. It has the following characteristics

The ASR receives recorded voice input from the IPA Service.
The ASR generates a recognition hypothesis from the received audio input optionally with a confidence score.
Optionally, the ASR can generate multiple recognition hypotheses along with a confidence score.
The ASR forwards the recognition hypotheses to the NLU.
The ASR may update the History with the determined recognition hypotheses.
In case of a text-based chatbot, this component will not be needed and input is directly forwarded from the IPA Service to the NLU

3.2.3 NLU

An Natural Language Understanding (NLU) component that able to extract meaning as intents and associated entities from an utterance as text strings.

Intent: An intent is a group of utterances with similar meaning.
Entity: An entity captures additional information to an intent.

The NLU is not needed in LLM based systems. If available, it has the following characteristics

The NLU consumes multiple incoming streams, e.g. from the ASR and for metadata augmenting the input from the IPA Service and must synchronize them into a single input, i.e. an input dialog move.
The NLU is able to handle basic functionality via Core Intent Sets to enable any interaction with the user at all.
The NLU may make use of Local Data Providers or Data Providers to access local or external.
The NLU components may make use of the Context to check for complementary information that might have been established throughout the interaction with the user to complete an intent's related entities or include external knowledge.
The NLU forwards the the derived semantic input from all received input streams to the Dialog Manager
Optionally, the NLU can generate multiple intents with their entities along with with a confidence score.

3.2.4 Dialog Manager

The Dialog Manager is a component that receives semantic information determined from user input, updates the dialog history, its internal state, decides upon subsequent steps to continue a dialog and provides output, mainly as synthesized or recorded utterances. Conceptually the dialog manager defines the playground that is used by the Dialogs and contributes significantly to the user experience. The Dialog Manager is available in traditional NLU based systems and has the following characteristics

The overall set of available Dialogs defines the behavior and capabilities of the interaction with the IPA.
The Dialog Manager is also responsible for a good user experience across the available Dialogs.
For this, it employs several Dialogs that are responsible for handling isolated tasks or intents. The following types of dialogs exist:
- Core Dialog
- Dialog X
The Dialog Manager follows the principle to fill in all slots that are known before prompting the user for additional slots.
The Dialog Manager receives input for the local IPA from the NLU and for the remote IPAs from the Provider Selection Service
The Dialog Manager selects the best suited input from the available input alternatives for further processing. For this, it should generally expect that the user may switch the goals and thus dialog flows at any time and should consider confirming that, but must also consider ongoing workflows that must not be interrupted.
The Dialog Manager may consider a maximum timespan to wait until the various inputs arrived and consider only those that arrive within that limit.
The Dialog Manager may update the History with dialog moves, i.e., determined input and output.
The Dialog Manager determines the Dialog following a Dialog Strategy that is best suited to serve the current user input and re-establishes the interaction state for that Dialog. Therefore, it may use the Dialog Registry.
The Dialog Manager receives the next dialog move as output from the selected Dialog.
Optionally, the Dialog Manager may receive the next dialog move via the IPA Service from the selected IPA Provider
The Dialog Manager makes use of the NLG to generate text to be converted into to audio data by the TTS to be rendered on the IPA Client
Alternatively, the Dialog Manager may receive audio output from the selected IPA Provider, e.g., to support branding. In this case, the output is directly sent to the IPA Service.
Alternatively, the Dialog Manager may receive text output from the selected IPA Provider, e.g., to support branding. In this case, the output is directly sent to the TTS.
As an extension, it may also provide commands as output to be executed by the IPA Client in the Local Services
As an extension, it may also provide commands as output to be executed by the Provider Selection Service in the External Services.
As an extension, Dialogs may also return multimodal output or text to be rendered by a respective modality synthesizer on the IPA Client.
The Dialog Manager may manage a session wrapping the overall interaction of a user with the IPA.

3.2.4.1 Dialog Strategy

A Dialog Strategy is a conceptualization of a dialog for an operationalization in a computer system. It defines the representation of the dialog's state and respective operations to process and generate events relevant to the interaction. This specification is agnostic to the employed Dialog Strategy. Examples of dialog strategy include

Dialog Strategy	Example
State-based	State Chart XML (SCXML): State Machine Notation for Control Abstraction
Frame-based	Voice Extensible Markup Language (VoiceXML) 2.1
Plan-based	Information State Update
Dialog State Tracking	Machine Learning for Dialog State Tracking: A Review
...	...

3.2.4.2 Session

Dialog execution can be governed by sessions, e.g. to free resources of ASR and NLU engines when a session expires. Linguistic phenomena, like anaphoric references and ellipsis, are expected to work within a session. Conceptually, multiple sessions can be active in parallel on a single IPA depending on the capabilities of the IPA. The selected IPA Providers or the Dialog Manager may have leading roles for the task of session management.

A session begins when

the user starts to interact with an IPA via a client activation strategy, or
the IPA pro-actively notifies the user

may continue over multiple interaction turns, i.e. an input and output cycle, and ends

if the user explicitly ends the interaction with the IPA,
if the IPA ends the interaction with the user, e.g. by saying "Goodbye", or
if the user does not start a new input within a predefined time span.

This includes the possibility that a session may persist over multiple requests.

3.2.5 Prompt Adaptation

Prompt adaptation is the process of adjusting the output of the IPA to the user's needs. This may include adjusting the prompt to the user's preferences, the user's current context, or the user's current environment. This component is not needed in traditionally NLU-based systems but is essential for LLM based systems. It has the following characteristics

It usually receives the decoded input from the ASR.
It optionally augments the received user input with additional information that is useful for the LLM to provide better responses.
An LLM usually does not maintain conversational state and may make use of the history to optimize the output by augmenting the prompt for this purpose.
It may receive additional input from remote IPAs via the Provider Selection Service and may additionally augment the prompt with these inputs.

3.2.6 LLM

LLMs stands for Large Language Models. These models can conceptually be perceived as a special type of a Dialog Manager that also include the NLU and NLG components. It is not needed in NLU-based systems and has the following characteristics

It receives a prompt as input from the Prompt aAdaptation.
It may also receive audio data if the LLM is able to handle audio input directly.. It may receive additional multimodal input from the IPA Service.
It usually generates the next dialog move as text-based output and forwards it to the TTS.
It may bypass the TTS and send the output directly to the IPA Client as audio data if the LLM is able to handle that.
It may also provide commands as output to be executed by the IPA Client in the Local Services.
It may also provide commands as output to be executed by the Provider Selection Service in the External Services.
It may make use of Knowledge Graphs to optimize the output, e.g. to make the output more accurate and reliable..

3.2.7 Context

During the interaction with a user all kinds of information are collected and managed in the so-called conversation context or dialog context. It contains all the short and long term information needed to handle a conversation and thus may exceed the concept of a session. It also serves for context-based reasoning with the help of the Knowledge Graph and to generate output for the output to the user NLG. It is not possible to capture each and every aspect of what context should comprise as discussions about context are likely to end up in trying to explain the world. For the sake of this specification it should be possible to deal with the following characteristics

The dialog context is enhanced to build interaction with the user (grounding) from spoken and other input.
The Context supports the Dialog Manager to get the needed information for a current dialog
The Context supports the Dialog Manager to get the needed information when switching from one dialog context to another
The Context supports the NLU to determine meaning from the user's input, also by reasoning via a Knowledge Graph.
The Context supports the NLG to create the reply to the user, e.g. to avoid repetition of information that is already known.
The Context may make use of the Local Service Registry to include external knowledge from Local Data Providers
The Context may make use of the External Service Registry to include external knowledge from External Data Providers
The Context may make use of the Provider Selection Service to include external knowledge from Data Providers
The Context may provide external knowledge temporarily to the Knowledge Graph to be considered in reasoning.

3.2.7.1 History

The Dialog History mainly stores the past dialog events per user. Dialog events include users’ transcriptions, semantic interpretations and resulting actions. Thus, it has information on how the user reacted in the past and knows her preferences. The history may also be used to resolve anaphoric references in the NLU or can be used as temporary knowledge in the Knowledge Graph.

Generative AI models may also use the history to add conversational context to the prompt.

3.2.7.2 Knowledge Graph

The system uses a knowledge graph, e.g., to reason about entities and intents. This may be received from the detected input from the NLU or Data Providers to come up with some more meaningful data matching the current task better. One example is the use of the name of a person as a navigation target as a person usually has an address that qualifies to be used in navigation tasks.

3.2.8 NLG

The natural language generation (NLG) component is responsible for preparing the natural language text that represents the system’s output. NLG is not needed in LLM based architectures. It has the following characteristics

The NLG receives the output dialog move from the Dialog Manager.
The NLG may make use of the Context to optimize the output.
The NLG sends the text string to be spoken to the TTS.
The NLG may update the History with the generated output.
In case of a text-based chatbot, the NLG forwards its output directly to the IPA Service.

3.2.9 TTS

The Text-to-Speech (TTS) component receives text strings, which it converts into audio data. Conceptually, the TTS is a modality specific renderer for speech. It has the following characteristics

The TTS receives its input from the NLG
Alternatively, the TTS may receive its input from the Dialog Manager if the output originates from an IPA Provider
Multiple TTS instances may exist in parallel, e.g. to distinguish between different active dialogs. In this case it is up to the current Dialog to specify the TTS engine to use.
In case of a text-based chatbot, this component will not be needed.

3.2.10 Dialogs

Dialogs support interaction with the user. They include Core Dialogs, which are built into the system, and provide basic interactions, as well as more specialized dialogs which support additional functionality.

3.2.10.1 Core Dialog

The Core Dialog are logical entities that are able to handle basic functionality via Core Intent Sets to enable interaction with the user at all. This includes among others

Core Dialog	Purpose
Greeting	Welcome the user and and prepare for initial input.
Help	The user asked for more guidance.
Goodbye	Terminate the interaction with the user.
Service not available	The dialog relies on reaching out for a specific service but was not able to reach it, e.g. because of connection issues.
Intent not known	The Provider Selection Service returned an intent that can not be handled by a corresponding Dialog.
No input	The user did not say anything within a predefined timespan
Error	An unknown error occurred, see also error handling
Transfer to external IPA Provider	Notify the user that the following dialog steps will be handled outside the scope of this IPA.
...	...

Conceptually, the Core Dialog is a special Dialog as described in the following section that is always available.

3.2.10.2 Dialog

A Dialog is able to handle functionality that can be added to the capabilities of the Dialog Manager through its associated Intent Sets. Dialogs are logical entities within the overall description of the interaction with the user, executed by the Dialog Manager. Dialogs must serve different purposes in the sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time. Dialogs have the following characteristics

Dialogs receive inputs as intents out of their supported Intent Sets along with associated entities and return responses as text strings to be spoken.
Dialogs reference all Intents from the Intent Sets that they need to fulfill their service.
Dialogs do not require the existence of a corresponding Intent Set.
Dialogs are expected to be slot-based and may specify entities from an Intent Set that are filled after their execution.
Dialogs may specify follow-up dialogs that are to be executed once execution of this dialog is completed.
Dialogs may specify clarification dialogs by name or by a list of entities from an Intent Set.
As an extension, Dialogs may also return commands to be executed by the IPA Client.
As an extension, Dialogs may also return multimodal output to be rendered by a respective modality synthesizer on the IPA Client.
Dialogs access the Provider Selection Service to fulfill their task. They maintain state which they also share with the Dialog Manager and know which IPA Provider evaluated their request with the help of an identifier.
A Dialog may specify a TTS engine to use in case there are multiple engines available.

3.2.10.3 Core Intent Sets

A Core Intent Set usually identifies tasks to be executed and defines the capabilities of the Core Dialog. Conceptually, the Core Intent Sets are Intent Sets that are always available.

3.2.10.4 Intent Sets

Intent Sets define actions, identified by the name of the intent, along with their parameters as entities as it is produced by the NLU that can be consumed by a corresponding Dialog and have the following characteristics

An Intent Set defines one or more intents with an optional number (including none) of entities to fulfill the corresponding action.
An Intent Set abstracts from actual Intent Sets that are defined by the Intent Providers, e.g. plan-travel or plan-air-travel used by different Intent Provider implementations into the one used in the Dialogs for travel-planning. In case the Intent Provider is identical to the platform provider, they may match.
Matching Intent Sets must be done carefully, as the various intent sets may not match one-to-one to not break the user experience. Therefore, the intent used in the Dialogs may be restricted to specific Intent Set as an addition to the default behavior.
It can be used in one or more Dialogs.

3.2.10.5 Dialog X

The Dialog X's are able to handle functionality that can be added to the capabilities of the Dialog Manager through their associated Intent Set X. A Dialog X extends the Core Dialogs and add functionality by custom Dialogs. The Dialog X's must server different purposes in a sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time. They have the same characteristics as a Dialog.

3.2.10.6 Intent Set X

An Intent Set X is a special Intent Set that identifies tasks that can be executed within the associated Dialog X.

3.2.10.7 Dialog Registry

The Dialog Registry manages all available Dialogs with their associated Intent Sets with respect to the current Dialog Strategy. This means, it is the Dialog Registry that would know which Dialog to use for a given intent. For some Dialog Strategy this component may be omitted as it is taken over by the Dialog Manager. One of these cases is when the Dialog Strategies does not allow for the dynamic handling of Dialogs as described below.

Dialogs and their Intent Sets can be added or removed as needed.
The Dialog Registry may notify the Dialog Manager if Dialogs have been added or removed.
The Dialog Registry may be queried by the Dialog Manager for Intent Sets that are referenced in a Dialog.
The Dialog Registry may be queried by the Dialog Manager for follow-up or clarification Dialogs that are referenced in a Dialog by name or a list of entities from an Intent Set.
Intent Sets will be removed if there are no more Dialogs referencing them.
The Dialog Registry ensures that added Dialogs are unique.
The Dialog Registry is not responsible for knowing about the counterparts in the External Data / Services / IPA Providers Layer.
The Dialog Registry notifies the Selection Service if Dialogs have been added or removed.

3.3 External Data / Services / IPA Providers Layer

3.3.1 Provider Selection Service

A service that provides access to all known Data Providers, External Services and IPA Providers. This service also maps the IPA Intent Sets to the Intent Sets in the Dialog layer. It has the following characteristics

The Provider Selection Service provides an interface to Data Providers, External Services and IPA Providers.
The Provider Selection Service may receive input from the Dialog Manager to query data from Data Providers.
The relevant Data Provider is obtained via its unique id from the External Service Registry.
The Provider Selection Service may receive input from the Dialog Manager to execute External Serives.
The relevant External Service is obtained via its unique id from the External Service Registry.
The Provider Selection Service receives input as audio data along with metadata
In case the Provider Selection Service is called with a preselected identifier of an IPA Provider only this one will be used as obtained from the Provider Registry
In case there are no IPA Providers preselected the Provider Selection Service has to follow a Provider Selection Strategy as detailed below to determine those IPA Providers that are best suited to answer the request. The resulting list of IPA Providers candidates is asked in parallel and those that return the n-best results are selected (n ≥ 1). Determining the best result considers at least a confidence score but may be improved by other metrics. It may be necessary that the filtered list requires disambiguation in an additional dialog step.
The Provider Selection Service makes use of Accounts/Authentication to access IPA Providers.
The Provider Selection Services uses the Provider Registry to map the Provider Intent Sets to the Intent Sets known by the Dialog Registry. The mapping must be configured when IPA Providers are added.
IPA Providers and the Accounts/Authentication to access them can be added or removed as needed.
In case no mapping to the Intent Sets known by the Dialog Registry is possible, the received Intent is used.
In case the Provider Selection Service retrieves a session identifier from the selected IPA Provider it stores it in the Provider Registry, e.g. for follow-up questions. Usually, this session identifier is different to the session identifier which is known by the Dialog Manager.
The Provider Selection Service is stateless and always returns the n-best responses from the used IPA Providers along with an identification of the issuing IPA Provider.
Alternatively, the Provider Selection Service may return output as text strings to be rendered by the TTS
Alternatively, the Provider Selection Service may return audio output to be played by the Speaker

3.3.1.1 Provider Selection Strategy

The Provider Selection Strategy aims at determining those IPA Providers that are most likely suited to handle the current input. Generally,the system should not make any assumptions about the user's current input as she may switch goals with each input but there may be some deviating use cases. The provider selection strategy may be implemented for example as one of the following options or a combination thereof to determine a list of IPA Providers candidates.

All known IPA Providers are used. This strategy may only apply if there are only a small number of IPA Providers.
The IPA Providers is filtered by contextual data that is obtained from the client, e.g. location.
The IPA Providers is filtered by established knowledge about the user, e.g. language.
The IPA Providers is filtered based on user preferences.
The IPA Providers is filtered by knowledge that has been determined in the dialog with the user. This includes leading wake-up phrases like Hey Siri, …, OK Google, …. For this, preprocessing of the user input by they NLU may be required.

In case the IPA Provider does not abstract from determining a relevant list of intents, the same strategy may be applied to determine the n-best intents.

3.3.1.2 Provider Registry

A registry for all IPA Providers that can be accessed. It has the following characteristics

The Provider Registry can be queried for a list of IPA Providers along with their unique identifier.
Each of the IPA Providers should have a list of names in the supported languages to allow for preselecting the IPA Providers in an utterance or to allow for disambiguation of multiple IPA Providers in an additional dialog step.
The Provider Registry can return an IPA Providers for a current identifier.
The Provider Registry knows the Intent Sets of a specific IPA Providers from the addition of that IPA Providers.
Each Intent from the Intent Sets of a specific IPA Providers must also specify the mapping to the Intent Sets known by the Dialog Registry.
Each IPA Providers may have an associated session identifier to resume an existing session.
IPA Providers may be obtained from a standardized market place.

3.3.1.3 Accounts/Authentication

A registry that knows how to access the known IPA Providers, i.e., which are available and credentials to access them. Storing of credentials must meet security and trust considerations that are expected from such a personalized service. It has the following characteristics

It returns an authentication means for a key of an IPA Providers that is known to the Provider Registry
In case an IPA Provider does not require authentication, this is indicated to the caller.

3.3.2 External Service Registry

A registry for all External Services and Data Providers that can be accessed by the client

The External Service Registry maintains a list of External Services and Data Providers along with their unique identifier that may be accessed by the Provider Selection Service or the Context.
The External Service Registry may allow to add External Services and Data Providers at runtime.
External Services and Data Providers may be obtained from a standardized market place.

3.3.3 Data Providers

Data Providers obtain data from various external sources for use in the interaction, for example, data obtained from a third-party web service.

3.3.3.1 Data Provider X

A data provider to get data to be used in the Dialog, e.g. as a result of a query.

3.3.4 External Services

External Services provide access to trigger actions outside of the system; for example, triggered from a third-party web service.

3.3.4.1 External Service X

A specific External Service, which provides output of the system, e.g. through an application can use multiple External Services.

3.3.5 IPA Providers

IPA providers provide IPA's that can interact with users in an application.

In this sense an IPA might be again a fully fledged IPA, with the exception of the Client Layer as this IPA will take over the role of a client to the nested IPA. Actually, this can be perceived as the Matryoshka (or Russian Doll) principle¹. Each IPA may be perfectly used as is but can also be approached by other IPAs.

3.3.5.1 IPA Provider X

A provider of an IPA service, like

Google Assistant
Amazon Alexa
Microsoft Cortana
SoundHound
…

The IPA provider may be part of the IPA implementation as an IPA Provider or alternatively a subset of the original functionality as described below as part of another IPA implementation.

3.3.5.2 Provider ASR

An ASR component receives audio streams of recorded utterances and generates a recognition hypothesis as text strings as an input for the Provider NLU.

3.3.5.2 Provider NLU

An NLU component that is able to extract meaning as intents and associated entities from an utterance as text strings for IPA Provider X. It has the following characteristics

The Provider NLU may be specialized to handle specific domains
Optionally, the Provider NLU can generate multiple intents with their entities along with with a confidence score.
The Provider NLU may make use of own Provider Intenet Sets indpendent of the Core Intent Sets which are then mapped in the Provider Selection Service so that they can be consumed by the Dialog Manager
The Provider NLU may make use of the Data Provider to access local or internal data or access external services.
The Provider NLU may make use of the Knowledge Graph to derive meaning.

3.3.5.3 Provider Intent Set

An Intent Set that might be returned by the Provider NLU to handle the capabilities of IPA Provider X.

3.3.5.3 Provider LLM

An LLM that is used by the provider.

3.6 Resulting Architecture

The previous sections showed a more detailed view onto the architectural buildings blocks. A general overview comprising these detailing is shown in the following figure. Note, that both NLU-based as well as Generative AI based architectures are combined and only those components are needed that are needed in the envisioned IPA type.

NLU-based IPA Architecture — Fig. 3 Complete architecture of an IPA

4. Error Handling

Errors may occur anywhere in the processing chain of the IPA. The following gives an overview of how they are suggested to be handled.

Along the processing path errors may occur

in the response of a call to another component
inside this component to be further processed by subsequent components

As a consequence of the latter, components must be prepared to receive an error message or a list thereof instead of the actually expected data. Errors should only be forwarded in case there are no valid continuations that have a chance to provide a response to the IPA user.

In case multiple errors are received the component should try (in the following order) to

check if there is a possible continuation to handle the error with a meaningful reply to the user
identify the most severe error and optionally forward this error along with a list of less severe errors
derive a new higher-level error from the received errors and forward this higher-level error

In case errors could be handled it is recommended to log the errors for debugging.

An error message should contain at least

an error code that could be transformed into a IPA response matching the language and conversation
a human-readable error message for logging and debugging
an id of the component that has produced or handled the error

5. Use Case Walk Through

This section needs to be updated to match the changes as introduced above.

This section expands on the use case above, filling in details according to the sample architecture.

A user would like to plan a trip to an international conference and she needs visa information and airline reservations.

The user starts by asking a general purpose assistant (IPA Client, on the left of the diagram) about what the visa requirements are for her situation. For a common situation, such as citizens of the EU traveling to the United States, the IPA is able to answer the question directly from one of its dialogs 1-n getting the information from a web service that it knows about via the corresponding Data Provider. However, for less common situations (for example, a citizen of South Africa traveling to Japan), the generic IPA will try to identify a visa expert assistant application from the dialog registry. If it finds one, it will connect the user with the visa expert, one of the IPA providers on the right side. The visa expert will then engage in a dialog with the user to find out the dates and purposes of travel and will inform the user of the visa process.

Once the user has found out about the visa, she tells the IPA that she wants to make airline reservations. If she wants to use a particular service, or use a particular airline, she would say something like "I want to book a flight on American". The IPA will then either connect the user with American's IPA or, if American doesn't have an IPA, will inform the user of that fact. On the other hand, if the user doesn't specify an airline, the IPA will find a general flight search IPA from its registry and connect the user with the IPA for that flight search service. The flight search IPA will then interact with the user to find appropriate flights.

A similar process would be repeated if the user wants to book a hotel, find a rental car, find out about local attractions in the destination city, etc. Booking a hotel could also involve interacting with the conference's IPA to find out about a designated conference hotel or special rates.

5.1 Detailed Walkthrough

This section provides a detailed walkthrough for an NLU-based IP that aligns the steps in the use case interaction with the architecture. It covers only the part from the example above that the user asks for a flight travel with a dedicated airline. This very basic example assumes that this is the first request to IPA and that there is a suitable dialog ready that matches the user's request. It may also vary, e.g., depending on the used Dialog Strategy and other optional items that may actually result in different flows. The walkthrough is split into two parts for the input path and for the output path.

5.1.1 Walkthrough for the Input Path

We begin with the case where the user's request can be handled by one of the internal Dialogs in the Dialog box. The input side is illustrated in the following figure

IPA Architecture Walkthrough for the input — Fig. 3 Walkthrough for the output path of an IPA

The user asks the IPA client about a travel between the EU and the United States. The IPA Client captures the audio with the help of the microphone.
Requests are usually augmented by other data. The GPS location is one example that could be useful. Therefore the IPA Client asks the Local Data Provider for GPS for the current location...
...and gets it back. In this case the GPS coordinates from Mountain View, California.
The audio is sent along with all augmenting data to the IPA Service.
The IA Service forwards the received data simultaneously to the ASR in the local path and to the Provider Selection Service in the remote path.
The decoded text of the user's request, in this example "I want to book a flight on American" with all augmented data in parallel to the NLU component for the local path and to the Provider Selection Service for the remote path.
In the local path the NLU tries to determine intents and entities from the decoded text. For our example this may be intent: plan-flight-travel with entity destination: American. The NLU components makes use of the context to check if there are complementary information that might have been established throughout the interaction with the user, such as preferred times for departure or arrival.
There was no info to add from the history but the GPS information could be mapped with the help of the Knowledge Graph to origin: SFO so the local input path is completed with this step with the result: plan-flight-travel with entities airline: American, origin: SFO.
The remote path starts with the Provider Selection Service asking the Provider Registry for suitable IPA Providers for the incoming request.
The Provider Registry filters the suitable IPA Providers and asks for credentials at the Accounts/Authentication component. For the example, these may be those supporting English. At this level, only the pure text is known and the used language. Further knowledge about the user may be helpful to reduce these candidates.
The Provider Registry receives the credentials for the IPA Provider candidates.
The Provider Selection Service receives the list of IPA Providers along with their credentials, if any, back.
The Provider Registry forwards the text "I want to book a flight on American" from the utterance and the GPS coordinates for Mountain View to the received list of IPA Providers in parallel to determine meaning which completes the remote input path.

5.1.2 Walkthrough for the Output Path

The output path begins where the local NLU and IPA Providers are able to deliver their results. In both paths the best match for the intents and entities based on the received data have been identified. This path is illustrated in the following figure

IPA Architecture Walkthrough for the output — Fig. 4 Walkthrough for the output path of an IPA

The IPA Providers send their determined intents along with recognized entities to the Provider Selection Service. For our example this may be
- IPA Provider 1: phantastic-plan-flight-travel with entities preferred-airline: American, preferred-origin: SFO.
- IPA Provider 2: rail-plan-travel with entity destination-station: American.
- IPA Provider 3: transfer-money with no entities
Note, that the reply also contains an identification of the provider for their result. This allows pre-selection of a provider in possible follow-up dialog turns.
The Provider Selection Service maps the custom intents and entities to the core intents and entities that can be understood in the dialogs. For our example this could be
- IPA Provider 1: plan-flight-travel with entities airline: American, origin: SFO.
- IPA Provider 2: plan-rail-travel with entity destination: American.
- IPA Provider 3: transfer-money with no entities
It then sends this mapped result to the Dialog Manager as an n-best list.
On the local path the NLU sends it result to the Dialog Manager. For our example this could be
- Local NLU: plan-flight-travel with entity destination: American, origin: SFO.
The Dialog Manager determines an n-best list of meanings from the local and remote path as
- IPA Provider 1: plan-flight-travel with entities airline: American, origin: SFO.
- Local NLU: plan-flight-travel with entity destination: American, origin: SFO.
- IPA Provider 2: plan-rail-travel with entity destination: American.
- IPA Provider 3: transfer-money with entities: bank: American, purpose: book flight
It selects the best suited reply. For our example, it may remove the results from IPA Provider 2 and IPA Provider 3 as the confidence for the entity is very low and updates the History with the determined dialog move from the user. Results from IPA Provider 1 and Local NLU have the same result, however due to the employed rules, IPA Provider 1 is selected as cloud based providers are expected to have better accuracy than local engines because of constraints with the embedded environment.
The Dialog Manger then sends the intent, plan-flight-travel to the Dialog Registry to determine the corresponding dialog...
...and receives the dialog to use back. For the example this may be the plan-flight-travel-dialog.
The Dialog Manager calls the plan-flight-travel dialog and fills all known entities. In our example, the slots for airline and origin would be filled.
The Dialog determines the next dialog step and indicates the request for a system move to query the user for the missing data.
The History is updated with this dialog move ...
...and forwarded to the NLG to create a response.
The NLG makes use of the Context to check output preferences and already established knowledge between the user and the system that might be used in the reply...
...and receives the info back to come up with the question "Do you want to fly from San Francisco with American?",
The NLU forwards the text string "Do you want to fly from San Francisco with American?" to the TTS to be converted into audio.
The TTS engine sends the audio file from the response to the IPA Client to be made audible...
...in the Speaker.

7. Potential for Standardization

The general architecture of IPAs described in this document should be detailed in subsequent documents. Further work must be done to

specify the interfaces among the components
suggest new standards where they are missing
refer to existing standards where applicable
refer to existing standards as a starting point to be refined for the IPA case

Currently, the authors see the following situation at the time of writing

Component	Potentially related standards
IPA Client	(X)HTML
IPA Service	none
Dialog Manager	Voice Extensible Markup Language (VoiceXML) 2.1 State Chart XML (SCXML)
TTS	Web Speech API Speech Synthesis Markup Language (SSML) Version 1.0 Pronunciation Lexicon Specification Version 1.0 Emotion Markup Language (EmotionML) 1.0 ToBI
ASR	Web Speech API Speech Recognition Grammar Specification Version 1.0 Pronunciation Lexicon Specification Version 1.0 Semantic Interpretation for Speech Recognition (SISR) Version 1.0
Core Dialog	Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech Acts (DAMSL)
Core Intent Set	none
Dialog Registry	Discovery & Registration of Multimodal Modality Components
Provider Selection Service	none
Accounts/Authentication	Web Authentication IDO Universal Authentication Framework
NLU	EMMA: Extensible MultiModal Annotation markup language Version 2.0 JSON Representation of Semantic Information
Knowledge Graph	Web Ontology Language (OWL) Resource Description Framework (RDF)
Data Provider	none

The table above is not meant to be exhaustive nor does it claim that the identified standards are suited for IPA implementations. They must be analyzed in more detail in subsequent work. The majority are starting points for further refinement. For instance, the authors consider it unlikely that VoiceXML will actually be used in IPA implementations.

Out of scope of a possible standardization is the implementation inside the IPA Providers and potential interoperability among them. However, it eases the the integration of their exposed services or even allow to use services across different providers. Actual IPA providers may make use of any upcoming standard to enhance their deployments as a marketplace of intelligent services.

7. Footnotes

^{1. The Russian Doll principle is a recursion
technique that is used in computer science, mathematics, logic,
grammar, and art. It is a problem-solving strategy for dealing
with complexity, where the same control structure always occurs
on multiple, infinitely nested levels. The principle is
illustrated in the form of Russian dolls (matryoshkas) that are
nested such that the same homomorphic structure appears on each
level. Summarized from Pfiffner, M. (2022). Russian Dolls. In:
The Neurology of Business. Management for Professionals.
Springer, Cham. https://doi.org/10.1007/978-3-031-14260-4_5. ↩}

8. Appendix

8.1 Acknowledgments

This version of the document was written with the participation of members of the W3C Voice Interaction Community Group. The work of the following members has significantly facilitated the development of this document:

James Larson, The Open Voice Network
Jon Stine, The Open Voice Network

8.2 Abbreviations

Abbreviation	Description
ASR	Automated Speech Recognition
LLM	Large Language Model
NLG	Natural Language Generation
NLU	Natural Language Understanding
TTS	Text to Speech