Intelligent Personal Assistant Architecture

Intelligent Personal Assistant Interfaces

Latest version: <<<<<<< HEAD Last modified: March 7, 2025 >>>>>> branch 'master' of git@github.com:w3c/voiceinteraction.git href="https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paInterfaces/paInterfaces.htm">https://github.com/w3c/voiceinteraction/blob/master/voice%20interaction%20drafts/paInterfaces/paInterfaces.htm (GitHub repository)
HTML rendered version
Editor: Dirk Schnelle-Walka
Deborah Dahl, Conversational Technologies

Copyright © 2022-2025 the Contributors to the Voice Interaction Community Group, published by the Voice Interaction Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.

Abstract

This document details the general architecture of Intelligent Personal Assistants as described in Architecture and Potential for Standardization Version 1.3 with regard to interface definitions. The architectural descriptions focus on intent-based voice-based personal assistants and chatbots. Current LLM intent-less chatbots may have other interface needs.

Status of This Document

This specification was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

Intelligent Personal Assistant Architecture

Intelligent Personal Assistant Interfaces

Abstract

Status of This Document

2.1.1 Weather Information

2.1.2 Flight Reservation

3. Architecture

3.1 Architectural Principle

3.2 Main Use Cases

3.3 Component Chaining

4. High Level Data Structures and Interfaces

4.1 IPARequest

4.2 IPAResponse

4.3 ExternalIPAResponse

4.4 External Service Call

4.5.Error Handling

4.6.Examples

4.6.1 Example Weather Information for the High Level Interfaces and Data Structures

4.6.2 Example Flight Reservation for the High Level Interfaces and Data Structures

5. Low Level Interfaces

5.1. Client Layer

5.1.1 IPA Client

5.1.2 IPA Client, Capture and Presentation Modality Components

5.1.2.1 Client Activation Strategy

5.2.4 Dialog Management

5.3 External Data / Services / IPA Providers

5.3.1 Provider Selection Service

1. Introduction

Intelligent Personal Assistants (IPAs) are now available in our daily lives through our smart phones. Apple’s Siri, Google Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more are helping us with various tasks, like shopping, playing music, setting a schedule, sending messages, and offering answers to simple questions. Additionally, we equip our households with smart speakers like Amazon’s Alexa or Google Home which are available without the need to pick up explicit devices for these sorts of tasks or even control household appliances in our homes. As of today, there is no interoperability among the available IPA providers. Especially for exchanging learned user behaviors this is unlikely to happen at all.

Furthermore, in addition to these general-purpose assistants, there are also specialized virtual assistants which are able to provide their users with in-depth information which is specific to an enterprise, government agency, school, or other organization. They may also have the ability to perform transactions on behalf of their users, such as purchasing items, paying bills, or making reservations. Because of the breadth of possibilities for these specialized assistants, it is imperative that they be able to interoperate with the general-purpose assistants. Without this kind of interoperability, enterprise developers will need to re-implement their intelligent assistants for each major generic platform.

This document is the second step in our strategy for IPA standardization. It is based on a general architecture of IPAs described in Architecture and Potential for Standardization Version 1.3 which aims at exploring the potential areas for standardization. It focuses on voice as the major input modality. We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best practices and seek adjacencies for creation or investment. The overall concept is not restricted to voice but also covers purely text based interactions with so-called chatbots as well as interaction using multiple modalities. Conceptually, the authors also define executing actions in the user's environment, like turning on the light, as a modality. This means that components that deal with speech recognition, natural language understanding or speech synthesis will not necessarily be available in these deployments. In case of chatbots, speech components will be omitted. In case of multimodal interaction, interaction modalities may be extended by components to recognize input from the respective modality, transform it into something meaningful and vice-versa to generate output in one or more modalities. Some modalities may be used as output-only, like turning on the light, while other modalities may be used as input-only, like touch.

In this second step we describe the interfaces of the general architecture of IPAs in Architecture and Potential for Standardization Version 1.3. We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best practices and seek adjacencies for creation or investment.

A reference implementation is available at GitHub. This implementation is not intended to be a production-ready implementation, but rather a proof of concept that demonstrates the feasibility of the architecture and interfaces described in this document. The reference implementation is intended to be used as a starting point for developers who wish to create their own implementations of the architecture and interfaces. Developers are encouraged to use the reference implementation as a guide and to modify it as needed to suit their specific requirements.

In order to cope with such use cases as those described above an IPA follows the general design concepts of a voice user interface, as can be seen in Figure 1.

Interfaces and data structures are described with the help of UML diagrams. We expect the reader to be familiar with that notation, although most concepts are easy to understand and do not require in-depth knowledge. The main diagram types used in this document are component diagrams and sequence diagrams. The UML diagrams are provided as Enterprise Architect Model pa-architecture.EAP. They can be viewed with the free of charge tool EA Lite

2. Problem Statement

2.1 Use Cases

This section describes potential usages of IPAs that will be used later in the document to illustrate the usage of the specified interfaces.

2.1.1 Weather Information

A user located in Berlin, Germany, is planning to visit her friend a few kilometers away, the next day. As she considers taking the bike, she asks the IPA for weather conditions.

2.1.2 Flight Reservation

A user located in Berlin, Germany, would like to plan a trio to an international conference and she wants to book a flight to the conference in San Francisco. Therefore, she approaches the IPA to help her with booking the flight,

3. Architecture

3.1 Architectural Principle

The architecture described in this document follows the SOLID principle introduced by Robert C. Martin to arrive at a scalable, understandable and reusable software solution.

Single responsibility principle: The components should have only one clearly-defined responsibility.
Open closed principle: Components should be open for extension, but closed for modification.
Liskov substitution principle: Components may be replaced without impacts onto the basic system behavior.
Interface segregation principle: Many specific interfaces are better than one general-purpose interface.
Dependency inversion principle: High-level components should not depend on low-level components. Both should depend on their interfaces.

This architecture aims at following both, a traditional partitioning of conversational systems, with separate components for speech recognition, natural language understanding, dialog management, natural language generation, and audio output, (audio files or text to speech) as well as newer LLM (Large Language Model) based approaches. This architecture does not rule out combining some of these components in specific systems.

3.2 Main Use Cases

Among others, the following most popular high-level use cases for IPAs are to be supported

Question Answering or Information Retrieval
Executing local and/or remote services to accomplish tasks

This is supported by a flexible architecture that supports dynamically adding local and remote services or knowledge sources such as data providers. Moreover, it is possible to include other IPAs, with the same architecture, and forward requests to them, similar to the principle of a Russian doll (omitting the Client Layer). All this describes the capabilities of the IPA. These extensions may be selected from a standardized marketplace. For the reminder of this document, we consider an IPA that is extendible via such a marketplace.

The following table lists the IPA main use cases and related examples that are used in this document

Main Use Case	Example
Question Answering or Information Retrieval	Weather information
Executing local and/or remote services to accomplish tasks	Flight reservation

These main use cases are shown in the following figure

Not all components may be needed for actual implementations, some may be omitted completely. Especially, LLM-based architectures may combine the functionality of multiple components into only one or few components. However, we note them here to provide a more complete picture.

The architecture comprises three layers that are detailed in the following sections

Client Layer
Dialog Layer
External Data / Services / IPA Providers

Actual implementations may want to distinguish more or fewer than these layers. The assignment to the layers is not considered to be strict so that some of the components may be shifted to other layers as needed. This view only reflects a view that the Community Group regard as ideal and to show the intended separation of concerns.

These three main components can be described as follows

Client: The Client Layer contains the main components that interface with the user.
Dialog: The Dialog Layer contains the main components to drive the interaction with the user. The dialog layer may either be traditionally NLU-based or based on Generative AI.
External Data / Services / IPA Providers: The Data Layer contains the main components that provide external data, services, or access to other IPAs.

According to these components they are assigned to the packages shown below.

3.3 Component Chaining

The most flexible way to combine the components is to chain them. This means that the output of one component is the input of the next component. Therefore, we introduced the concept of an IPADataProcessor that may receive IPAData from incoming IPADataProcessors and forwards IPAData to outgoing IPADataProcessor. The classes for the major components are shown in the figure 4 a.

IPA Data Processing Chain — Fig 4 b) Example IPADatat Processing Chain

Actual implementations may link them together as needed. An example is shown in figure 4 b. In this example four IPADataProcessors are shown: Client, IPAService, ProviderSelectionService, and DialogManager. In the actual chain, IPAService and Client appear twice, once in the beginning and once in the end of the chain. This chain is then used to hand IPAData, generated in the beginning of the chain towards the end. For example, The IPADatProcessor Client generates the IPAData MultimodalInputs. In this example it is expected to contain only data for the single modality text. This is forwarded to the next IPADataProcessor IPAService for further processing.

4. High Level Data Structures and Interfaces

This section details the interfaces from the figure shown in the architecture. The interfaces are described with the following attributes

name: Name of the attribute
type: Hint if this attribute is a single data item or a category. The exact data types of the attributes are left open for now. A category may contain other categories or data items.
description: A short description to illustrate the purpose of this attribute.
required: Flag, if this attribute is required to be used in this interface.

A typical flow for the high level interfaces is shown in the following figure.

IPA Major Components Interaction — Fig. 5 IPA Major Component Interaction

This sequence supports the major use cases stated above.

4.1 IPARequest

This data structure describes the data that is sent from the IPA Client to the IPA Service and reused inside the IPA. The following table details the corresponding data elements.

name	type	description	required
session id	data item	unique identifier of the session	yes, if obtained
request id	data item	unique identifier of the request within a session	yes
audio data	data item	encoded or raw audio data	yes
multimodal input	category	input that has been received from modality recognizers, e.g., text, gestures, pen input, ...	no
meta data	category	data augmenting the request, e.g., user identification, timestamp, location, ...	no

The session id can be created by the IPA Service. It may not be known to the client for the first request. In this case, this field is simply left empty. The IPA Service may maintain a session id, e.g., to serve multiple clients and allow them to be distinguished. In these cases a session id is provided, that must be used for subsequent calls of the IPA Client.

The IPA Client maintains request id for each request that is being sent. These ids must be unique within a session.

Audio data can be delivered mainly in two ways

Endpointed audio data
Streamed audio data

For endpointed audio data the IPA Client determines the end of speech, e.g., with the help of voice activity detection. In this case only that portion of audio is sent that contains the potential spoken user input.In terms of user experience this means that processing of the user input can only happen after the end of speech has been detected.

For streamed audio data, the IPA Client starts sending audio data as soon as it has been detected that the user is speaking to the system with the help of the Client Activation Strategy. In terms of user experience this means that processing of the user input can happen while the user is speaking.

An audio codec may be used, e.g., to reduce the amount of data to be transferred. The selection of the codec is not part of this specification.

Optionally, multimodal input can be transferred that has been captured as input from a specific modality recognizer. Modalities are all other modalities but audio, e.g., text for a chat bot, or gestures.

Optionally, meta data may be transferred augmenting the input. Examples of such data include user identification, timestamp and location.

4.2 IPAResponse

As a return value this data structure describes the data that is sent from the IPA Service to the IPA Client. The following table details the corresponding data elements.

name	type	description	required
session id	data item	unique identifier of the session	yes, if obtained
request id	data item	unique identifier of the request within a session	yes
audio data	data item	encoded or raw audio data	yes
multimodal output	category	output that has been received from modality synthesizers, e.g., text, command to execute an observable action, ...	no

In case the parameter multimodal output contains commands to be executed, they are expected to follow the specification of the Interface Service Call.

4.3 ExternalIPAResponse

This data structure describes the data that is returned In traditional NLU systems this is sent from the Provider Selection Service and the NLU and Dialog Management. The following table details the data that should be considered for this interface in the method ExternalClientResponse.

name	type	description	required
session id	data item	unique identifier of the session	yes, if the IPA requires the usage
request id	data item	unique identifier of the request within a session	yes
call result	data item	success or failure	yes
multimodal output	category	output that has been received from an external IPA	yes, if no interpretation is provided and no error occurred
interpretation	category	meaning as intents and associated entities	yes, if no multimodal output is provided and no error occurred
error	category	error as detailed in section Error Handling	yes, if an error during execution is observed

The parameters name, session id and request id are copies of the data received from the Interface Client Input.

This call is optional depending if external IPAs are used or not.

Depending on the capabilities of the external IPA the return value may be one of the following options

multimodal output
interpretation

The category interpretation may be one of the following options, depending on the capabilities of the external IPA

single-intent, i.e. provide multiple intents in a single utterance
multi-intent, i.e. provide one intent in a single utterance

With single-intent the user provides a single intent per utterance. An example for single-intent is "Book a flight to San Francisco for tomorrow morning." The single intent is here book-flight. With multi-intent the user provides multiple intents in a single utterance. An example for multi-intent is "How is the weather in San Francisco and book a flight for tomorrow morning." Provided intents are check-weather and book-flight. In this case the IPA needs to determine the order of intent execution based on the structure of the utterance. If not to be done in parallel, the IPA will trigger the next intent in the identified order.

As multi-intent is not very common in today's IPAs the focus for now is on single-intent as detailed in the following table

name			data type	description	required
interpretation			list	list of meaning as intents and associated entities	yes
	intent		string	group of utterances with similar meaning	yes
	intent confidence		float	confidence value for the intent in the range [0,1]	no
	entities		list	list of entities associated to the intent	no
		name of the entity	string	additional information to the intent	no
		entity confidence	float	confidence value for the entity in the range [0,1]	no

4.4 External Service Call

This interface describes the data that is sent from the Dialog to the Provider Selection Service. The following table details the data that should be considered for this interface in the method callService.

name	type	description	required
session id	data item	unique identifier of the session	yes, if the IPA requires the usage
request id	data item	unique identifier of the request within a session	yes
service id	data item	id of the service to be executed	yes
parameters	data item	Parameters to the service call	no

As a return value the result of this call is sent back in the ClientResponse.

name	type	description	required
session id	data item	unique identifier of the session	yes, if the IPA requires the usage
request id	data item	unique identifier of the request within a session	yes
service id	data item	id of the service that was executed	yes
call result	data item	success or failure	yes
call result details	data item	detailed information in case of a failed service call	no
error	category	error as detailed in section Error Handling	yes, if an error during execution is observed

This call is optional depending on the result of the next dialog step if an external service should be called or not.

4.5.Error Handling

Errors may occur anywhere in the processing chain of the IPA. The following gives an overview of how they are suggested to be handled.

Along the processing path errors may occur

in the response of a call to another component
inside this component to be further processed by subsequent components

Error messages carry the following information

name	type	description	required
error code	data item	unique error code that could be transformed into a IPA response matching the language and conversation	yes
error message	data item	human-readable error message for logging and debugging	yes
component id	data item	id of the component that has produced or handled the error	yes

4.6.Examples

The following sections will provide examples using the JSON format to illustrate the usage of above mentioned data structures and interfaces. JSON is only chosen as it is easy to understand and read. This specification does not make any assumptions about the underlying programming languages or data format. They are just meant to be an illustration of how responses may be generated with the provided data. It is not required that implementations follow exactly the described behavior. It is also not required that JSON is used at all.

4.6.1 Example Weather Information for the High Level Interfaces and Data Structures

The following request of an IPARequest sends endpointed audio data with the user's current location to query for tomorrow's weather with the utterance What will the weather be like tomorrow".

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "audio": {
        "type": "Endpointed",
        "data": "ZmhhcGh2cGF3aGZwYWhuZ...zI0MDc4NDY1NiB5dGhvaGF3",
        "encoding": "PCM-16BIT"
    }
    "multimodal": {
        "location": {
            "latitude": 52.51846213843821,
            "longitude": 13.37872252544883338.897957
        }
        ...
    },
    "meta": {
        "timestamp": "2022-12-01T18:45:00.000Z"
        ...
    }
}

In this example endpointed audio data is transfered as a value. There are other ways to send the audio data to the IPA, e.g., as a reference. This way is chosen as it is easier to illustrate the usage.

In return an external IPA may send back the following ExternalIPAResponse to the Dialog.

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "callResult": "success",
    "interpretation": [
        {
            "intent": "check-weather",
            "intentConfidence": 0.9,
            "entities": [
                {
                    "location": "Berlin",
                    "entityConfidence": 1.0
                },
                {
                    "date": "2022-12-02",
                    "entityConfidence": 0.94
                },
            ]
        },
        ... 
    ]
}

The external speech recognizer converts the obtained audio into text like How will be the weather tomorrow. The NLU then extracts the following from that decoded utterance, other multimodal input and metadata.

intent: check-weather from, e.g., utterance part How will the weather…
entity: date from utterance part …tomorrow…
entity: location, e.g., from the multimodal input of location

This is illustrated in the following figure.

Processing Input of the check weather example

The following request to callService may be made to call the weather information service to actually obain the requested information. Although calling the weather service is not a direct functionality of the IPA, it may help to understand how the entered data may be processed to obtain a spoken reply to the user's input.

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "services": [
        {
            "serviceId": "weather-service",
            "parameters": [
                {
                    "location": "Berlin",
                    "date": "2022-12-02"
                }
            ]
        },
        ... 
    ]
}

In return the the external service may send back the following response ExternalIPAResponse to the Dialog

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "callResult": "success",
    "services": [
        {
            "serviceId": "weather-information",
            "callResult": "success",
            "callResultDetails": [
                {
                    "location": "Berlin",
                    "date": "2022-12-02",
                    "forecast": "snow showers",
                    "minTemperature": -1,
                    "maxTemperature": 0,
                    ...
                }
            ]
        },
        ... 
    ]
}

In return the IPA may send back the following response Tomorrow there will be snow showers in Berlin with temperatures between 0 and -1 degrees via an IPAResponse to the Client.

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "audio": {
        "type": "Endpointed",
        "data": "Uvrs4hcGh2cGF3aGZwYWhuZ...vI0MDc4DGY1NiB5dGhvaRD2",
        "encoding": "PCM-16BIT"
    }
    "multimodal": {
        "text": "Tomorrow there will be snow showers in Berlin with temperatures between 0 and -1 degrees."
        ...
    },
    "meta": {
        ...
    }
}

4.6.2 Example Flight Reservation for the High Level Interfaces and Data Structures

The following IPARequest sends endpointed audio data with the user's current location to book a flight with the utterance I want to fly to San Francisco.

{
    "sessionId": "0c27895c-644d-11ed-81ce-0242ac120002",
    "requestId": "15",
    "audio": {
        "type": "Endpointed",
        "data": "ZmhhcGh2cGF3aGZwYWhuZ...zI0MDc4NDY1NiB5dGhvaGF3",
        "encoding": "PCM-16BIT"
    }
    "multimodal": {
        "location": {
            "latitude": 52.51846213843821,
            "longitude": 13.37872252544883338.897957
        }
        ...
    },
    "meta": {
        "timestamp": "2022-11-14T19:50:00.000Z"
        ...
    }
}

The external speech recognizer converts the obtained audio into text like I want to fly to San Francisco. The NLU then extracts the following from that decoded utterance, other multimodal input and metadata.

intent: book-fligh from, e.g., utterance part I want to fly…
entity: location from utterance part …San Francisco…
entity: location, e.g., from the multimodal input of location

This is illustrated in the following figure. Processing Input of the flight reservation example

Further steps will be needed to convert both location entities to origin and destination in the actual reply. This may be either done by the flight reservation IPA directly or by calling external services beforehand to determine the nearest airports from these locations.

In return the the IPA may send back the following response When do you want to fly from Berlin to San Francisco? via IPAResponse to the Client

{
    "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002",
    "requestId": "42",
    "audio": {
        "type": "Endpointed",
        "data": "Uvrs4hcGh2cGF3aGZwYWhuZ...vI0MDc4DGY1NiB5dGhvaRD2",
        "encoding": "PCM-16BIT"
    }
    "multimodal": {
        "text": "When do you want to fly from Berlin to San Francisco?"
        ...
    },
    "meta": {
        ...
    }
}

5. Low Level Interfaces

5.1. Client Layer

The Client Layer contains the main components that interface with the user.

These main components of the client layer can be described as follows

IPA Client: Main component to control all interaction wit the user and the local environment.
Local Services: Local services can be used to execute local actions in the user's local environment.
Local Data Providers: Local Data Providers capture input that is accessible in the user's local environment. They can be used to provide additional input to the IPA Client or to provide additional information that is needed to execute services.
Presentation: Presentation devices or modality synthesizers are used to provide system output to the user. Additional output modalities can be employed that render their output with a specific modality synthesizer. It is not always required that a verbal auditory output is made as a reply to a user. The user can also become aware of the output as a consequence of an observable action as a result of a Local Service within the Client Layer or an External Services call from the External Data / Services / IPA Providers Layer. In these cases an additional nonverbal auditory output may be considered.
Capture: Capture devices or modality recognizers are used to capture multimodal user input, such as voice or text input. Additional input modalities can be employed that capture input with a specific modality recognizers. Additional input may be gathered from Local Data Providers.

5.1.1 IPA Client

Clients enable the user to access the IPA, e.g., via voice. The following diagram provides some more insight.

The main components of the IPA Client can be described as follows. Not all of them are required and may be tailored to the actual use case.

Interaction Manager: The Interaction Manager enables access to the modalities that are supported by the IPA Client. Major modalities are voice and text in case of chatbots.
Capture Synchronization Strategy: Synchronization of inputs from the available modality components before they are forwarded for further processing

The overall way, how these components collaborate in one input-output cycle is shown in the following figure.

5.1.2 IPA Client, Capture and Presentation Modality Components

Capture and Presentation Modality Components are the direct interfaces to the user. They are controlled by the IPA Client and exchange data for input and output with it.

IPA Client with Capture and Presentation

The main components in this diagram can be described as follows. Not all of them are required and may be tailored to the actual use case.

IPA Client: The main process on the client side to control capture and presentation components.
Capture: Capture devices or modality recognizers are used to capture multimodal user input, such as voice or text input. Additional input modalities can be employed that capture input with a specific modality recognizers.
Presentation: Presentation devices or modality synthesizers are used to provide system output to the user. Additional output modalities can be employed that render their output with a specific modality synthesizer.

5.1.2.1 Client Activation Strategy

The Client Activation Strategy defines how the client gets activated to be ready to receive spoken commands as input. In turn the Microphone is opened for recording. While the IPA Client is usually responsible to control the lifecycle of the modality components, the Client Activation Strategy is a trigger to the Interaction Manager to start. Client Activation Strategies are not exclusive but may be used concurrently. The most common activation strategies are described in the table below.

Client Activation Strategy	Description
Push-to-talk	The user explicitly triggers the start of the client by means of a physical or on-screen button or its equivalent in a client application.
Hotword	In this case, the user utters a predefined word or phrase to activate the client by voice. Hotwords may also be used to preselect a known IPA Provider. In this case the identifier of that IPA Provider is also used as additional metadata augmenting the input. This hotword is usually not part of the spoken command that is passed for further evaluation.
Local Data Providers	In this case, a change in the environment may activate the client, for example if the user enters a room.
...	...

The usage of hotwords includes privacy aspects as the microphone needs to be always active. Streaming to the components outside the user's control should be avoided, hence detection of hotwords should ideally happen locally. With regard to nested usage of IPAs that may feature their own hotwords, the detection of hotwords might be required to be extensible.

5.2 Dialog Layer

The Dialog Layer contains the main components to drive the interaction with the user.

5.2.1 IPA Service

5.2.2 ASR

5.2.3 NLU

5.2.4 Dialog Management

5.3 External Data / Services / IPA Providers

External Data / Services / IPA Providers Component