W3C

Intelligent Personal Assistant Architecture

Intelligent Personal Assistant Interfaces

Latest version
Last modified: December 2, 2024 https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paInterfaces/paInterfaces.htm (GitHub repository)
HTML rendered version
Editor
Dirk Schnelle-Walka
Deborah Dahl, Conversational Technologies

Abstract

This document details the general architecture of Intelligent Personal Assistants as described in Architecture and Potential for Standardization Version 1.3 with regard to interface definitions. The architectural descriptions focus on intent-based voice-based personal assistants and chatbots. Current LLM intent-less chatbots may have other interface needs.

Status of This Document

This specification was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

Table of Contents

  1. Introduction
  2. Problem Statement
  3. Architecture
  4. High Level Interfaces
  5. Low Level Interfaces

1. Introduction

Intelligent Personal Assistants (IPAs) are now available in our daily lives through our smart phones. Apple’s Siri, Google Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more are helping us with various tasks, like shopping, playing music, setting a schedule, sending messages, and offering answers to simple questions. Additionally, we equip our households with smart speakers like Amazon’s Alexa or Google Home which are available without the need to pick up explicit devices for these sorts of tasks or even control household appliances in our homes. As of today, there is no interoperability among the available IPA providers. Especially for exchanging learned user behaviors this is unlikely to happen at all.

Furthermore, in addition to these general-purpose assistants, there are also specialized virtual assistants which are able to provide their users with in-depth information which is specific to an enterprise, government agency, school, or other organization. They may also have the ability to perform transactions on behalf of their users, such as purchasing items, paying bills, or making reservations. Because of the breadth of possibilities for these specialized assistants, it is imperative that they be able to interoperate with the general-purpose assistants. Without this kind of interoperability, enterprise developers will need to re-implement their intelligent assistants for each major generic platform.

This document is the second step in our strategy for IPA standardization. It is based on a general architecture of IPAs described in Architecture and Potential for Standardization Version 1.3 which aims at exploring the potential areas for standardization. It focuses on voice as the major input modality. We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best practices and seek adjacencies for creation or investment. The overall concept is not restricted to voice but also covers purely text based interactions with so-called chatbots as well as interaction using multiple modalities. Conceptually, the authors also define executing actions in the user's environment, like turning on the light, as a modality. This means that components that deal with speech recognition, natural language understanding or speech synthesis will not necessarily be available in these deployments. In case of chatbots, speech components will be omitted. In case of multimodal interaction, interaction modalities may be extended by components to recognize input from the respective modality, transform it into something meaningful and vice-versa to generate output in one or more modalities. Some modalities may be used as output-only, like turning on the light, while other modalities may be used as input-only, like touch.

In this second step we describe the interfaces of the general architecture of IPAs in Architecture and Potential for Standardization Version 1.3. We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best practices and seek adjacencies for creation or investment.

In order to cope with such use cases as those described above an IPA follows the general design concepts of a voice user interface, as can be seen in Figure 1.

Interfaces and data structures are described with the help of UML diagrams. We expect the reader to be familiar with that notation, although most concepts are easy to understand and do not require in-depth knowledge. The main diagram types used in this document are component diagrams and sequence diagrams. The UML diagrams are provided as Enterprise Architect Model pa-architecture.EAP. They can be viewed with the free of charge tool EA Lite

2. Problem Statement

2.1 Use Cases

This section describes potential usages of IPAs that will be used later in the document to illustrate the usage of the specified interfaces.

2.1.1 Weather Information

A user located in Berlin, Germany, is planning to visit her friend a few kilometers away, the next day. As she considers taking the bike, she asks the IPA for weather conditions.

2.1.2 Flight Reservation

A user located in Berlin, Germany, would like to plan a trio to an international conference and she wants to book a flight to the conference in San Francisco. Therefore, she approaches the IPA to help her with booking the flight,

3. Architecture

3.1 Architectural Principle

The architecture described in this document follows the SOLID principle introduced by Robert C. Martin to arrive at a scalable, understandable and reusable software solution.

Single responsibility principle
The components should have only one clearly-defined responsibility.
Open closed principle
Components should be open for extension, but closed for modification.
Liskov substitution principle
Components may be replaced without impacts onto the basic system behavior.
Interface segregation principle
Many specific interfaces are better than one general-purpose interface.
Dependency inversion principle
High-level components should not depend on low-level components. Both should depend on their interfaces.

This architecture aims at following both, a traditional partitioning of conversational systems, with separate components for speech recognition, natural language understanding, dialog management, natural language generation, and audio output, (audio files or text to speech) as well as newer LLM (Large Language Model) based approaches. This architecture does not rule out combining some of these components in specific systems.

3.2 Main Use Cases

Among others, the following most popular high-level use cases for IPAs are to be supported

  1. Question Answering or Information Retrieval
  2. Executing local and/or remote services to accomplish tasks

This is supported by a flexible architecture that supports dynamically adding local and remote services or knowledge sources such as data providers. Moreover, it is possible to include other IPAs, with the same architecture, and forward requests to them, similar to the principle of a Russian doll (omitting the Client Layer). All this describes the capabilities of the IPA. These extensions may be selected from a standardized marketplace. For the reminder of this document, we consider an IPA that is extendible via such a marketplace.

The following table lists the IPA main use cases and related examples that are used in this document

Main Use Case Example
Question Answering or Information Retrieval Weather information
Executing local and/or remote services to accomplish tasks Flight reservation

These main use cases are shown in the following figure

Main IPA Use Cases
Fig. 1 Main IPA Use Cases

Not all components may be needed for actual implementations, some may be omitted completely. Especially, LLM-based architectures may combine the functionality of multiple components into only one or few components. However, we note them here to provide a more complete picture.

The architecture comprises three layers that are detailed in the following sections

  1. Client Layer
  2. Dialog Layer
  3. External Data / Services / IPA Providers

Actual implementations may want to distinguish more or fewer than these layers. The assignment to the layers is not considered to be strict so that some of the components may be shifted to other layers as needed. This view only reflects a view that the Community Group regard as ideal and to show the intended separation of concerns.

IPA Major Components
Fig. 2 IPA Major Components

According to these components they are assigned to the packages shown below.

IPA Package Hierarchy
Fig. 3 IPA Package Hierarchy

3.3 Component Chaining

The most flexible way to combine the components is to chain them. This means that the output of one component is the input of the next component. Therefore, we introduced the concept of an IPADataProcessor that may receive IPAData from incoming IPADataProcessors and forwards IPAData to outgoing IPADataProcessor. The classes for the major components are shown in the figure 4 a.

IPA Data Processors
Fig 4 a) IPA Data Processor Examples
IPA Data Processing Chain
Fig 4 b) Example IPADatat Processing Chain

Actual implementations may link them together as needed. An example is shown in figure 4 b. In this example four IPADataProcessors are shown: Client, IPAService, ProviderSelectionService, and DialogManager. In the actual chain, IPAService and Client appear twice, once in the beginning and once in the end of the chain. This chain is then used to hand IPAData, generated in the beginning of the chain towards the end. For example, The IPADatProcessor Client generates the IPAData MultimodalInputs. In this example it is expected to contain only data for the single modality text. This is forwarded to the next IPADataProcessor IPAService for further processing.

4. High Level Data Structures and Interfaces

This section details the interfaces from the figure shown in the architecture. The interfaces are described with the following attributes

name
Name of the attribute
type
Hint if this attribute is a single data item or a category. The exact data types of the attributes are left open for now. A category may contain other categories or data items.
description
A short description to illustrate the purpose of this attribute.
required
Flag, if this attribute is required to be used in this interface.

A typical flow for the high level interfaces is shown in the following figure.

IPA Major Components Interaction
Fig. 5 IPA Major Component Interaction

This sequence supports the major use cases stated above.

4.1 IPARequest

This data structure describes the data that is sent from the IPA Client to the IPA Service and reused inside the IPA. The following table details the corresponding data elements.

name type description required
session id data item unique identifier of the session yes, if obtained
request id data item unique identifier of the request within a session yes
audio data data item encoded or raw audio data yes
multimodal input category input that has been received from modality recognizers, e.g., text, gestures, pen input, ... no
meta data category data augmenting the request, e.g., user identification, timestamp, location, ... no

The session id can be created by the IPA Service. It may not be known to the client for the first request. In this case, this field is simply left empty. The IPA Service may maintain a session id, e.g., to serve multiple clients and allow them to be distinguished. In these cases a session id is provided, that must be used for subsequent calls of the IPA Client.

The IPA Client maintains request id for each request that is being sent. These ids must be unique within a session.

Audio data can be delivered mainly in two ways

  1. Endpointed audio data
  2. Streamed audio data

For endpointed audio data the IPA Client determines the end of speech, e.g., with the help of voice activity detection. In this case only that portion of audio is sent that contains the potential spoken user input.In terms of user experience this means that processing of the user input can only happen after the end of speech has been detected.

For streamed audio data, the IPA Client starts sending audio data as soon as it has been detected that the user is speaking to the system with the help of the Client Activation Strategy. In terms of user experience this means that processing of the user input can happen while the user is speaking.

An audio codec may be used, e.g., to reduce the amount of data to be transferred. The selection of the codec is not part of this specification.

Optionally, multimodal input can be transferred that has been captured as input from a specific modality recognizer. Modalities are all other modalities but audio, e.g., text for a chat bot, or gestures.

Optionally, meta data may be transferred augmenting the input. Examples of such data include user identification, timestamp and location.

4.2 IPAResponse

As a return value this data structure describes the data that is sent from the IPA Service to the IPA Client. The following table details the corresponding data elements.

name type description required
session id data item unique identifier of the session yes, if obtained
request id data item unique identifier of the request within a session yes
audio data data item encoded or raw audio data yes
multimodal output category output that has been received from modality synthesizers, e.g., text, command to execute an observable action, ... no

In case the parameter multimodal output contains commands to be executed, they are expected to follow the specification of the Interface Service Call.

4.3 ExternalIPAResponse

This data structure describes the data that is returned In traditional NLU systems this is sent from the Provider Selection Service and the NLU and Dialog Management. The following table details the data that should be considered for this interface in the method ExternalClientResponse.

name type description required
session id data item unique identifier of the session yes, if the IPA requires the usage
request id data item unique identifier of the request within a session yes
call result data item success or failure yes
multimodal output category output that has been received from an external IPA yes, if no interpretation is provided and no error occurred
interpretation category meaning as intents and associated entities yes, if no multimodal output is provided and no error occurred
error category error as detailed in section Error Handling yes, if an error during execution is observed

The parameters name, session id and request id are copies of the data received from the Interface Client Input.

This call is optional depending if external IPAs are used or not.

Depending on the capabilities of the external IPA the return value may be one of the following options

The category interpretation may be one of the following options, depending on the capabilities of the external IPA

With single-intent the user provides a single intent per utterance. An example for single-intent is "Book a flight to San Francisco for tomorrow morning." The single intent is here book-flight. With multi-intent the user provides multiple intents in a single utterance. An example for multi-intent is "How is the weather in San Francisco and book a flight for tomorrow morning." Provided intents are check-weather and book-flight. In this case the IPA needs to determine the order of intent execution based on the structure of the utterance. If not to be done in parallel, the IPA will trigger the next intent in the identified order.

As multi-intent is not very common in today's IPAs the focus for now is on single-intent as detailed in the following table

name data type description required
interpretation list list of meaning as intents and associated entities yes
intent string group of utterances with similar meaning yes
intent confidence float confidence value for the intent in the range [0,1] no
entities list list of entities associated to the intent no
name of the entity string additional information to the intent no
entity confidence float confidence value for the entity in the range [0,1] no

4.4 External Service Call

This interface describes the data that is sent from the Dialog to the Provider Selection Service. The following table details the data that should be considered for this interface in the method callService.

name type description required
session id data item unique identifier of the session yes, if the IPA requires the usage
request id data item unique identifier of the request within a session yes
service id data item id of the service to be executed yes
parameters data item Parameters to the service call no

As a return value the result of this call is sent back in the ClientResponse.

name type description required
session id data item unique identifier of the session yes, if the IPA requires the usage
request id data item unique identifier of the request within a session yes
service id data item id of the service that was executed yes
call result data item success or failure yes
call result details data item detailed information in case of a failed service call no
error category error as detailed in section Error Handling yes, if an error during execution is observed

This call is optional depending on the result of the next dialog step if an external service should be called or not.

4.5.Error Handling

Errors may occur anywhere in the processing chain of the IPA. The following gives an overview of how they are suggested to be handled.

Along the processing path errors may occur

  1. in the response of a call to another component
  2. inside this component to be further processed by subsequent components

Error messages carry the following information

name type description required
error code data item unique error code that could be transformed into a IPA response matching the language and conversation yes
error message data item human-readable error message for logging and debugging yes
component id data item id of the component that has produced or handled the error yes

4.6.Examples

The following sections will provide examples using the JSON format to illustrate the usage of above mentioned data structures and interfaces. JSON is only chosen as it is easy to understand and read. This specification does not make any assumptions about the underlying programming languages or data format. They are just meant to be an illustration of how responses may be generated with the provided data. It is not required that implementations follow exactly the described behavior. It is also not required that JSON is used at all.

4.6.1 Example Weather Information for the High Level Interfaces and Data Structures

The following request of an IPARequest sends endpointed audio data with the user's current location to query for tomorrow's weather with the utterance What will the weather be like tomorrow".

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "audio": {
        "type": "Endpointed",
        "data": "ZmhhcGh2cGF3aGZwYWhuZ...zI0MDc4NDY1NiB5dGhvaGF3",
        "encoding": "PCM-16BIT"
    }
    "multimodal": {
        "location": {
            "latitude": 52.51846213843821,
            "longitude": 13.37872252544883338.897957
        }
        ...
    },
    "meta": {
        "timestamp": "2022-12-01T18:45:00.000Z"
        ...
    }
}

In this example endpointed audio data is transfered as a value. There are other ways to send the audio data to the IPA, e.g., as a reference. This way is chosen as it is easier to illustrate the usage.

In return an external IPA may send back the following ExternalIPAResponse to the Dialog.

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "callResult": "success",
    "interpretation": [
        {
            "intent": "check-weather",
            "intentConfidence": 0.9,
            "entities": [
                {
                    "location": "Berlin",
                    "entityConfidence": 1.0
                },
                {
                    "date": "2022-12-02",
                    "entityConfidence": 0.94
                },
            ]
        },
        ... 
    ]
}

The external speech recognizer converts the obtained audio into text like How will be the weather tomorrow. The NLU then extracts the following from that decoded utterance, other multimodal input and metadata.

This is illustrated in the following figure.

Processing Input of the check weather example

The following request to callService may be made to call the weather information service to actually obain the requested information. Although calling the weather service is not a direct functionality of the IPA, it may help to understand how the entered data may be processed to obtain a spoken reply to the user's input.

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "services": [
        {
            "serviceId": "weather-service",
            "parameters": [
                {
                    "location": "Berlin",
                    "date": "2022-12-02"
                }
            ]
        },
        ... 
    ]
}

In return the the external service may send back the following response ExternalIPAResponse to the Dialog

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "callResult": "success",
    "services": [
        {
            "serviceId": "weather-information",
            "callResult": "success",
            "callResultDetails": [
                {
                    "location": "Berlin",
                    "date": "2022-12-02",
                    "forecast": "snow showers",
                    "minTemperature": -1,
                    "maxTemperature": 0,
                    ...
                }
            ]
        },
        ... 
    ]
}

In return the IPA may send back the following response Tomorrow there will be snow showers in Berlin with temperatures between 0 and -1 degrees via an IPAResponse to the Client.

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "audio": {
        "type": "Endpointed",
        "data": "Uvrs4hcGh2cGF3aGZwYWhuZ...vI0MDc4DGY1NiB5dGhvaRD2",
        "encoding": "PCM-16BIT"
    }
    "multimodal": {
        "text": "Tomorrow there will be snow showers in Berlin with temperatures between 0 and -1 degrees."
        ...
    },
    "meta": {
        ...
    }
}

4.6.2 Example Flight Reservation for the High Level Interfaces and Data Structures

The following IPARequest sends endpointed audio data with the user's current location to book a flight with the utterance I want to fly to San Francisco.

{
    "sessionId": "0c27895c-644d-11ed-81ce-0242ac120002",
    "requestId": "15",
    "audio": {
        "type": "Endpointed",
        "data": "ZmhhcGh2cGF3aGZwYWhuZ...zI0MDc4NDY1NiB5dGhvaGF3",
        "encoding": "PCM-16BIT"
    }
    "multimodal": {
        "location": {
            "latitude": 52.51846213843821,
            "longitude": 13.37872252544883338.897957
        }
        ...
    },
    "meta": {
        "timestamp": "2022-11-14T19:50:00.000Z"
        ...
    }
}

The external speech recognizer converts the obtained audio into text like I want to fly to San Francisco. The NLU then extracts the following from that decoded utterance, other multimodal input and metadata.

This is illustrated in the following figure. Processing Input of the flight reservation example

Further steps will be needed to convert both location entities to origin and destination in the actual reply. This may be either done by the flight reservation IPA directly or by calling external services beforehand to determine the nearest airports from these locations.

In return the the IPA may send back the following response When do you want to fly from Berlin to San Francisco? via IPAResponse to the Client

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "audio": {
        "type": "Endpointed",
        "data": "Uvrs4hcGh2cGF3aGZwYWhuZ...vI0MDc4DGY1NiB5dGhvaRD2",
        "encoding": "PCM-16BIT"
    }
    "multimodal": {
        "text": "When do you want to fly from Berlin to San Francisco?"
        ...
    },
    "meta": {
        ...
    }
}

5. Low Level Interfaces

This section is still under preparation.

5.1. Client Layer

The Client Layer contains the main components that interface with the user.

Client Component

5.1.1 IPA Client

Clients enable the user to access the IPA via voice. The following diagram provides some more insight.

IPA Client

5.1.1.1 Modality Manager

The modality manager enables access to the modalities that are supported by the IPA Client. Major modalities are voice and text in case of chatbots. The following interfaces are supported

5.1.1.2 Client Activation Strategy

The Client Activation Strategy defines how the client gets activated to be ready to receive spoken commands as input. In turn the Microphone is opened for recording. Client Activation Strategies are not exclusive but may be used concurrently. The most common activation strategies are described in the table below.

Client Activation Strategy Description
Push-to-talk The user explicitly triggers the start of the client by means of a physical or on-screen button or its equivalent in a client application.
Hotword In this case, the user utters a predefined word or phrase to activate the client by voice. Hotwords may also be used to preselect a known IPA Provider. In this case the identifier of that IPA Provider is also used as additional metadata augmenting the input. This hotword is usually not part of the spoken command that is passed for further evaluation.
Local Data Providers In this case, a change in the environment may activate the client, for example if the user enters a room.
... ...

The usage of hotwords includes privacy aspects as the microphone needs to be always active. Streaming to the components outside the user's control should be avoided, hence detection of hotwords should ideally happen locally. With regard to nested usage of IPAs that may feature their own hotwords, the detection of hotwords might be required to be extensible.

5.2 Dialog Layer

The Dialog Layer contains the main components to drive the interaction with the user.

Dialog Component

5.2.1 IPA Service

5.2.2 ASR

5.2.3 NLU

5.2.4 Dialog Management

5.3 External Data / Services / IPA Providers

External Data / Services / IPA Providers Component

5.3.1 Provider Selection Service