Intelligent Personal Assistant Architecture

Architecture and Potential for Standardization Version 1.0

Latest version: Last modified: March 24, 2020 https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paArchitecture/paArchitecture.htm (GitHub repository); HTML rendered version
Editors

Copyright © 2019-2020 the Contributors to the Voice Interaction Community Group, published by the Voice Interaction Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.

Abstract

This documents describes a general architecture of Intelligent Personal Assistants and explores the potential for standardization. It is meant to be a first structured exploration of Intelligent Personal Assistants by idenitifying the components and their tasks. Subsequent work is expected to detail the interaction among the identified components and how they ought to perform their task as well as their actual tasks respectively. This document may need to be updated if any changes result of that detailing work.

Status of This Document

This specification was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

Comments should be sent to the Voice Interaction Community Group public mailing list (public-voiceinteraction@w3.org), archived at https://lists.w3.org/Archives/Public/public-voiceinteraction

Introduction
Problem Statement
Architecture
Use Case Walk Through
Potential for Standardization

1. Introduction

Intelligent Personal Assistants (IPA)s are already available in our daily lives through our smart phones. Apple’s Siri, Google Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more are helping us with various tasks, like shopping, playing music, setting schedule, sending messages, and offering answers to simple questions. Additionally, we equip our households with smart speakers like Amazon’s Alexa or Google Home to be available without the need to pick up explicit devices for these sorts of tasks or even control household appliances in our homes. As of today, there is no interoperability between the available IPA providers. Especially for exchanging learned user behaviors this is unlikely to happen at all.

This document describes a general architecture of IPAs and explores the potential areas for standardization. It focuses on voice as the major input modality. However, the overall concept is not restricted to that but also covers purely text based interactions with so-called chatbots as well as interaction using multiple modalities. Conceptually, the authors define executing of actions in the user's environment, like turning on the light, as a modality. This means that components that deal with speech recognition, natural language understanding or speech synthesis will not necessarily be available in these deployments. In case of chatbots, they will be omitted. In case of multimodal interaction, they may be extended by components to recognize input from the respective modality, transform it into something meaningful and vice-versa to generate output in one or more modalities. Some modalities may be used as output-only, like turning on the light, while other modalities may be used as input-only, like touch.

2. Problem Statement

Currently, users are mainly using the IPA Provider that is shipped with a certain piece of hardware. Thus, selection of a smart phone manufacturer actually determines which IPA implementation they are using. Switching among different IPA providers also involves switching a manufacturer, including high costs and getting used to some other way of operation that comes with the UX of the selected manufacturer. On the one hand users should have more freedom in selecting the IPA implementation they want. They are bound to use that service that is available in that implementation but not what they probably prefer. On the other hand, IPA providers, which are mainly producing software, must also function as hardware manufacturer to be successful. Additionally, manufacturers also have to take care to port existing services to their platform. A standardization would clearly lower the needed efforts for that and thus reduce costs.

In order to explore this, a typical usage scenario is described in the following section.

2.1 Use Cases

This section describes potential usages of IPAs.

2.1.1 Travel Planning

A user would like to plan a trip to an international conference and she needs visa information and airline reservations. She will give the intelligent personal assistant her visa information (her citizenship, where she is going, purpose of travel, etc.) and it will respond by telling her the documentation she needs, how long the process will take and what the cost will be. This may require the personal assistant to consult with an auxiliary web service or another personal assistant that knows about visas.

Once the user has found out about the visa, she tells the PA that she wants to make airline reservations. She specifies her dates of travel and airline preferences and the PA then interacts with her to find appropriate flights.

A similar process will be repeated if the user wants to book a hotel, find a rental car, or find out about local attractions in the destination city. Booking a hotel as part of attending a conference could also involve finding out about a designated conference hotel or special conference rates.

2.2 Roles and Responsibilites

Roles like user, developer, IPA supplier will be added in a future version of this document

3. Architecture

In order to cope with such use cases as described above, an IPA may need to make use of several services describing the capabilities of the IPA. These services may be selected from a standardized market place. For the reminder of this document, we consider an IPA that is extendable via such a market place. This kind of IPA features the architectural buildings blocks shown in the following figure.

This architecture comprises 3 layers that are detailed in the following sections

Client Layer
Dialog Layer
APIs / Data Layer

Actual implementations may want to distinguish more than these layers.

3.1 Client Layer

3.1.1 IPA Client

Clients enable the user to access the IPA via voice with the following characteristics.

Usually, IPA Clients make use of a microphone to capture the spoken input and a loud speaker to provide responses.
As an extension IPA Clients may also capture input via text and output text.
As an extension IPA Clients may also capture input from a specific modality recognizer.
As an extension IPA Clients may also capture contextual information, e.g. location.
As an extension an IPA Client may also receive commands to be executed locally.
As an extension an IPA Client may also receive multimodal output to be rendered by a respective modality synthesizer.

3.2 Dialog Layer

3.2.1 IPA Service

General IPA Service API that mediates between the user and the overall IPA system. The service layer may be omitted in case the IPA Client communicates directly with Dialog Management. However, this is not recommended as it may contradict the principle of seperation-of-concerns. It has the following characteristics

The IPA Service functions as an interface between the IPA Client and the Dialog Management and Provider Selection Service.
The output from the ASR is forwarded to the Provider Selection Service to determine meaning.
Alternatively, the IPA Service may receive multimodal or text input from the client and forwards it directly to the Provider Selection Service to determine meaning.
Alternatively, the output from the modality recognizers and contextual information may be forwarded directly to the Dialog Management.

3.2.2 Dialog Management

Component that receives semantic information determined from user input, updates its internal state, decides upon subsequent steps to continue a dialog and provides output mainly as synthesized or recorded utterances. It has the following characteristics

Dialog Management receives recorded voice input from the IPA Service and forwards it to the ASR
Dialog Management makes use of the TTS to generate audio data to be rendered on the IPA Client
As an extension, it may also provide commands as output to be executed by the IPA Client
As an extension Dialogs may also return multimodal output or text to be rendered by a respective modality synthesizer on the IPA Client.
For this, it employs several Dialogs that are responsible for handling isolated tasks or intents. The following types of dialogs exist:
- Core Dialog
- Dialog X
The overall set of available Dialogs defines the behavior and capabilities of the interaction with the IPA.
The Dialog Manager is also responsible for a good user experience across the available Dialogs.
The Dialog Manager determines the Dialog that is best suited to serve the current user input and re-establishes the interaction state for that Dialog. Therefore, it may use the Dialog Registry.
The Dialog Manager follows the principle to fill in all slots that are known before prompting the user for it.
The Dialog Manager also manages the session with a user. Conceptually, multiple sessions can be active in parallel. Dialogs are governed by Sessions, e.g. to free resources of ASR and NLU engines when a session expires. Linguistic phenomena, like anaphoric references and ellipsis are expected to work within a Session. The selected IPA Provider or the Dialog Manager may have leading roles for this task.
The Dialog Manager also features an ASR to convert spoken utterances into text strings and a TTS to convert text strings into audio.

3.2.3 ASR

The Automated Speech Recognizer (ASR) receives audio streams of recorded utterances and generates a recognition hypothesis as text strings. Conceptually, ASR is a modality recognizer for speech. It has the following characteristics

Optionally, the ASR can generate multiple recognition hypothesis along with a confidence score.
Optionally, the ASR can be part of the IPA Provider. In this case, the received audio streams must be forwarded to the Provider Selection Service. In this case the Core NLU must be part of an IPA Provider.
Multiple ASR instances may exist if multiple IPA Providers come with their own ASR
In case of a chatbot, this component will not be needed.

3.2.4 TTS

The Text-to-Speech (TTS) component receives text strings, which it converts into audio data. Conceptually, the TTS is a modality specifc renderer for speech. It has the following characteristics

Optionally, the TTS can also be part of the IPA Provider. In this case, the audio streams is retrieved from the Provider Selection Service.
In case the TTS is part of the IPA Provider, multiple TTS instances may exist. This may be useful in case of branding.
Multiple TTS instances may exist in parallel. In this case it is up to the current Dialog to specify the TTS engine to use.
In case of a chatbot, this component will not be needed.

3.2.5 Core Dialog

The Core Dialog is able to handle basic functionality via Core Intent Sets to enable interaction with the user at all. This includes among others

Greetings
Goodbye
Exception handling in case a requested service is not available
Exception handling in case a requested intent cannot be matched to a known Dialog
Help

Conceptually, the Core Dialog is a special Dialog as described in the following section that is always available

3.2.5.1 Dialog

A Dialog is able to handle functionality that can be added to the capabilities of the Dialog Management through their associated Intent Sets. The Dialogs must server different purpose in a sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time. Dialogs have the following characteristics

Dialogs receive inputs as intents out of their supported Intent sets along with associated entities and return responses as text strings to be spoken.
Dialogs reference all Intent Sets that they need to fullfill their service.
Dialogs do not require the existence of a corresponding Intent Set.
Dialogs may specify entities from an Intent Set that are filled after their execution.
Dialogs may specify follow-up dialogs that are to be executed once execution of this dialog is completed.
Dialogs may specify clarification dialogs by name or by a list of entities from an Intent Set.
As an extension Dialogs may also return commands to be executed by the IPA Client.
As an extension Dialogs may also return multimodal output to be rendered by a respective modality synthesizer on the IPA Client.
Dialogs access the Provider Selection Service to fulfill their task. They maintain state which they also share with the Dialog Management and know which IPA Provider evaluated their request with the help of an identifier.
A Dialog may specify a TTS engine to use in case there are multiple engines available.

3.2.6 Core Intent Sets

A Core Intent Set usually identifies tasks to be executed and define the capabilities of the Core Dialog. Conceptually, the Core Intent Sets are Intent Sets that are always available.

3.2.6.1 Intent Sets

Intent Sets define actions along with their parameters that can be consumed by a corresponding Dialog and has the following characteristics

An Intent Set defines one ore more intents with an optional number (including none) of entities to fulfill the corresponding action.
It abstracts from actual Intent Sets that are defined by the Intent Providers. In case the Intent Provider is identical to the platform provider, they may match.
It can be used in one or more Dialogs.

3.2.7 Dialog X

The Dialog X are able to handle functionality that can be added to the capabilities of the Dialog Manager through their associated Intent Set X. Dialog X extends the Core Dialogs and add functionality by custom Dialogs. The Dialog X's must server different purposes in a sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time. They have the same characteristics as a Dialog.

3.2.9 Intent Set X

An Intent Set X is a special Intent Set that identifies tasks that can be executed within the associated Dialog X.

3.2.10 Dialog Registry

The Dialog Registry manages all available Dialogs with their associated Intent Sets.

Dialogs and their Intent Sets can be added or removed as needed.
The Dialog Registry may notify the Dialog Management if Dialogs have been added or removed.
The Dialog Registry may be queried by the Dialog Management for Intent Sets that are referenced in a Dialog.
The Dialog Registry may be queried by the Dialog Management for follow-up or clarification Dialogs that are referenced in a Dialog by name or a list of entities from an Intent Set.
Intent Sets will be removed if there are no more Dialogs referencing them.
The Dialog Registry ensures that added Dialogs are unique
The Dialog Registry is not responsible to know about the counterparts in the APIs/Data Layer.
The Dialog Resgistry notifies the Selection Service if Dialogs have been added or removed.

3.3 APIs/Data Layer

3.3.1 Provider Selection Service

A service that provides access to all known IPA Providers. This service also maps the IPA Intent Sets to the Intent Sets in the Dialog layer. It has the following characteristics

The Provider Selection Service receives input as text strings and returns results as intents with all recognized entities from all IPA Providers that are able to reply to the user input along with associated entities.
In case the Provider Selection Service is called with a preselected IPA Providers only this one will be used.
IPA Provers and the Accounts/Authentication to access them and optionally ASR and TTS capabilities can be added or removed as needed.
The Provider Selection Service is stateless and always returns the responses from the used IPA Providers along with an identification of the issuing IPA Provider.
The Provider Selection Service makes use of the Accunts/Authentication to access IPA Provider.
The Provider Selection Services maps the Provider Intent Sets to the Intent Sets known by the Dialog Registry. The mapping must be configured when IPA Providers are added.
In case the ASR is bound to an IPA Providers the Provider Selection Service is able to consume audio stream and forward them to the available ASR engines.

3.3.2 Accounts/Authentication

A registry that knows how to access the known IPA Providers, i.e. which are available and credentials to access them. Storing of credentials must meet security and trust considerations that are expected from such a personalized service.

3.3.3 Core NLU

An NLU (Natural Language Understanding) component that is able to extract meaning as intents and associated entities from an utterance as text strings. It has the following characteristics

The Core NLU is able to handle basic functionality via Core Intent Sets to enable interaction with the user at all.
The Core NLU may make use of the Core Data Provider to access local or internal data or access external services.

3.3.4 Core Data Provider

A generic Data Provider to aid the Core NLU determining the intent.

3.3.5 IPA Provider X

A provider of an IPA service, like

Google Assistant
Amazon Alexa
Microsoft Cortana
SoundHound
…

The IPA provider may be part of the IPA implementation as an IPA Provider or alternatively a subset of the original functionaliy as described below as part of another IPA implementation.

3.3.5.1 Provider NLU

An NLU component that is able to extract meaning as intents and associated entities from an utterance as text strings for IPA Provider X. It has the following characteristics

The Provider NLU may make use of the Data Provider to access local or internal data or access external services.
The Provider NLU may make use of the Knowledge Graph to derive meaning.

3.3.5.2 Provider Intent Set

An Intent Set that might be returned by the Provider NLU to handle the capabilities of IPA Provider X.

3.3.5.3 Data Provider

A data provider to aid the Provider NLU in determining the intent. It has the following characteristics

The Data Provider provides access to
- local data,
- external data or
- external services.
The Data Provider may be used to track the IPA Provider’s state.

3.3.5.4 Knowledge Graph

A knowledge graph to reason about the detected input from the Provider NLU and Data Provider to come up with some more meaningful results.

4. Use Case Walk Through

This section expands on the use case above, filling in details according to the sample architecture.

A user would like to plan a trip to an international conference and she needs visa information and airline reservations.

The user starts by asking a general purpose assistant (IPA Client, on the left of the diagram) about what the visa requirements are for her situation. For a common situation, such as citizens of the EU traveling to the United States, the IPA is able to answer the question directly from one of its dialogs 1-n getting the information from a web service that it knows about via the corresponding Data Provider. However, for less common situations (for example, a citizen of South Africa traveling to Japan), the generic IPA will try to identify a visa expert assistant application from the dialog registry. If it finds one, it will connect the user with the visa expert, one of the IPA providers on the right side. The visa expert will then engage in a dialog with the user to find out the dates and purposes of travel and will inform the user of the visa process.

Once the user has found out about the visa, she tells the IPA that she wants to make airline reservations. If she wants to use a particular service, or use a particular airline, she would say something like "I want to book a flight on American". The IPA will then either connect the user with American's IPA or, if American doesn't have an IPA, will inform the user of that fact. On the other hand, if the user doesn't specify an airline, the IPA will find a general flight search IPA from its registry and connect the user with the IPA for that flight search service. The flight search IPA will then interact with the user to find appropriate flights.

A similar process would be repeated if the user wants to book a hotel, find a rental car, find out about local attractions in the destination city, etc. Booking a hotel could also involve interacting with the conference's IPA to find out about a designated conference hotel or special rates.

5. Potential for Standardization

The general architecture of IPAs described in this document should be detailed in subsequent documents. Further work must be done to

specify the interfaces among the components
suggest new standards where they are missing and may therefore
refer to existing standards where applicable
refer to existing standards as a starting point to be refined for the IPA case

Currently, the authors see the following situation at the time of writing

Component	Potentially related standards
IPA Client	(X)HTML
IPA Service	none
Dialog Management	Voice Extensible Markup Language (VoiceXML) 2.1
TTS	Web Speech API Speech Synthesis Markup Language (SSML) Version 1.0
ASR	Web Speech API Speech Recognition Grammar Specification Version 1.0
Core Dialog	none
Core Intent Set	none
Dialog Registry	Discovery & Registration of Multimodal Modality Components
Provider Selection Service	none
Accounts/Authentication	Web Authentication IDO Universal Authentication Framework
Core NLU	EMMA: Extensible MultiModal Annotation markup language Version 2.0 JSON Representation of Semantic Information
Data Provider	none

The table above is not meant to be exhaustive nor does it claim that the identified standards are suited for IPA implementations but must be analyzed in more detail in subsequent work. The majority of them is a starting point for further refinement. For instance, the authors consider it unlikely that VoiceXML will actually be used in IPA implementations.

Out of scope of a possible standardization is the implementation inside the IPA Providers and a potential interoperability among them. However, it eases the the integration of their exposed services or even allow to use services across different providers. Actual IPA providers may make use of any upcoming standard to enhance their deployments as a market place of intelligent services.