Copyright © 2022-2024 the Contributors to the Voice Interaction Community Group, published by the Voice Interaction Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.
This document details the general architecture of Intelligent Personal Assistants as described in Architecture and Potential for Standardization Version 1.3 with regard to interface definitions. The architectural descriptions focus on intent-based voice-based personal assistants and chatbots. Current LLM intent-less chatbots may have other interface needs.
This specification was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.
Intelligent Personal Assistants (IPAs) are now available in our daily lives through our smart phones. Apple’s Siri, Google Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more are helping us with various tasks, like shopping, playing music, setting a schedule, sending messages, and offering answers to simple questions. Additionally, we equip our households with smart speakers like Amazon’s Alexa or Google Home which are available without the need to pick up explicit devices for these sorts of tasks or even control household appliances in our homes. As of today, there is no interoperability among the available IPA providers. Especially for exchanging learned user behaviors this is unlikely to happen at all.
Furthermore, in addition to these general-purpose assistants, there are also specialized virtual assistants which are able to provide their users with in-depth information which is specific to an enterprise, government agency, school, or other organization. They may also have the ability to perform transactions on behalf of their users, such as purchasing items, paying bills, or making reservations. Because of the breadth of possibilities for these specialized assistants, it is imperative that they be able to interoperate with the general-purpose assistants. Without this kind of interoperability, enterprise developers will need to re-implement their intelligent assistants for each major generic platform.
This document is the second step in our strategy for IPA standardization. It is based on a general architecture of IPAs described in Architecture and Potential for Standardization Version 1.3 which aims at exploring the potential areas for standardization. It focuses on voice as the major input modality. We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best practices and seek adjacencies for creation or investment. The overall concept is not restricted to voice but also covers purely text based interactions with so-called chatbots as well as interaction using multiple modalities. Conceptually, the authors also define executing actions in the user's environment, like turning on the light, as a modality. This means that components that deal with speech recognition, natural language understanding or speech synthesis will not necessarily be available in these deployments. In case of chatbots, speech components will be omitted. In case of multimodal interaction, interaction modalities may be extended by components to recognize input from the respective modality, transform it into something meaningful and vice-versa to generate output in one or more modalities. Some modalities may be used as output-only, like turning on the light, while other modalities may be used as input-only, like touch.
In this second step we describe the interfaces of the general architecture of IPAs in Architecture and Potential for Standardization Version 1.3. We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best practices and seek adjacencies for creation or investment.
In order to cope with such use cases as those described above an IPA follows the general design concepts of a voice user interface, as can be seen in Figure 1.
Interfaces are described with the help of UML diagrams. We expect the reader to be familiar with that notation, although most concepts are easy to understand and do not require in-depth knowledge. The main diagram types used in this document are component diagrams and sequence diagrams. The UML diagrams are provided as Enterprise Architect Model pa-architecture.EAP. They can be viewed with the free of charge tool EA Lite
This section describes potential usages of IPAs that will be used later in the document to illustrate the usage of the specified interfaces.
A user located in Berlin, Germany, is planning to visit her friend a few kilometers away, the next day. As she considers taking the bike, she asks the IPA for weather conditions.
A user located in Berlin, Germany, would like to plan a trio to an international conference and she wants to book a flight to the conference in San Francisco. Therefore, she approaches the IPA to help her with booking the flight,
The architecture described in this document follows the SOLID principle introduced by Robert C. Martin to arrive at a scalable, understandable and reusable software solution.
This architecture aims at following both, a traditional partitioning of conversational systems, with separate components for speech recognition, natural language understanding, dialog management, natural language generation, and audio output, (audio files or text to speech) as well as newer LLM (Large Language Model) based approaches. This architecture does not rule out combining some of these components in specific systems.
Among others, the following most popular high-level use cases for IPAs are to be supported
This is supported by a flexible architecture that supports dynamically adding local and remote services or knowledge sources such as data providers. Moreover, it is possible to include other IPAs, with the same architecture, and forward requests to them, similar to the principle of a Russian doll (omitting the Client Layer). All this describes the capabilities of the IPA. These extensions may be selected from a standardized marketplace. For the reminder of this document, we consider an IPA that is extendible via such a marketplace.
The following table lists the IPA main use cases and related examples that are used in this document
Main Use Case | Example |
---|---|
Question Answering or Information Retrieval | Weather information |
Executing local and/or remote services to accomplish tasks | Flight reservation |
These main use cases are shown in the following figure
Not all components may be needed for actual implementations, some may be omitted completely. Especially, LLM-based architectures may combine the functionality of multiple components into only one or few components. However, we note them here to provide a more complete picture.
The architecture comprises three layers that are detailed in the following sections
Actual implementations may want to distinguish more or fewer than these layers. The assignment to the layers is not considered to be strict so that some of the components may be shifted to other layers as needed. This view only reflects a view that the Community Group regard as ideal and to show the intended separation of concerns.
According to these components they are assigned to the packages shown below.
The most flexible way to combine the components is to chain them. This means that the output of one component is the input of the next component. Therefore, we introduced the concept of an IPADataProcessor that may receive IPAData from incoming IPADataProcessors and forwards IPAData to outgoing IPADataProcessor. The classes for the major components are shown in the figure 4 a.
Actual implementations may link them together as needed. An example is shown in figure 4 b. In this example four IPADataProcessors are shown: Client, IPAService, ProviderSelectionService, and DialogManager. In the actual chain, IPAService and Client appear twice, once in the beginning and once in the end of the chain. This chain is then used to hand IPAData, generated in the beginning of the chain towards the end. For example, The IPADatProcessor Client generates the IPAData MultimodalInputs. In this example it is expected to contain only data for the single modality text. This is forwarded to the next IPADataProcessor IPAService for further processing.
This section details the interfaces from the figure shown in the architecture. The interfaces are described with the following attributes
A typical flow for the high level interfaces is shown in the following figure.
This sequence supports the major use cases stated above.
This interface describes the data that is sent from the IPA Client to the IPA Service. The following table details the data that should be considered for this interface in the method processInput
name | type | description | required |
---|---|---|---|
session id | data item | unique identifier of the session | yes, if obtained |
request id | data item | unique identifier of the request within a session | yes |
audio data | data item | encoded or raw audio data | yes |
multimodal input | category | input that has been received from modality recognizers, e.g., text, gestures, pen input, ... | no |
meta data | category | data augmenting the request, e.g., user identification, timestamp, location, ... | no |
The session id can be created by the IPA Service. In case a session id is provided, it must be used for subsequent calls.
The IPA Client maintains request id for each request that is being sent via this interface. These ids must be unique within a session.
Audio data can be delivered mainly in two ways
For endpointed audio data the IPA Client determines the end of speech, e.g., with the help of voice activity detection. In this case only that portion of audio is sent that contains the potential spoken user input.In terms of user experience this means that processing of the user input can only happen after the end of speech has been detected.
For streamed audio data, the IPA Client starts sending audio data as soon as it has been detected that the user is speaking to the system with the help of the Client Activation Strategy. In terms of user experience this means that processing of the user input can happen while the user is speaking.
An audio codec may be used, e.g., to reduce the amount of data to be transferred. The selection of the codec is not part of this specification.
Optionally, multimodal input can be transferred that has been captured as input from a specific modality recognizer. Modalities are all other modalities but audio, e.g., text for a chat bot, or gestures.
Optionally, meta data may be transferred augmenting the input. Examples of such data include user identification, timestamp and location.
The IPA Service may maintain a session id, e.g., to serve multiple clients and allow them to be distinguished.
As a return value this interface describes the data that is sent from the IPA Service to the IPA Client. The following table details the data that should be considered for this interface in the ClientResponse.
name | type | description | required |
---|---|---|---|
session id | data item | unique identifier of the session | yes, if obtained |
request id | data item | unique identifier of the request within a session | yes |
audio data | data item | encoded or raw audio data | yes |
multimodal output | category | output that has been received from modality synthesizers, e.g., text, command to execute an observable action, ... | no |
In case the parameter multimodal output contains commands to be executed, they are expected to follow the specification of the Interface Service Call.
The following sections will provide examples using the JSON format to illustrate the interfaces. JSON is only chosen as it is easy to understand and read. This specification does not make any assumptions about the underlying programming languages or data format. They are just meant to be an illustration of how responses may be generated with the provided data. It is not required that implementations follow exactly the described behavior.
The following request to processInput sends endpointed audio data with the user's current location to query for tomorrow's weather with the utterance What will the weather be like tomorrow".
{ "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002", "requestId": "42", "audio": { "type": "Endpointed", "data": "ZmhhcGh2cGF3aGZwYWhuZ...zI0MDc4NDY1NiB5dGhvaGF3", "encoding": "PCM-16BIT" } "multimodal": { "location": { "latitude": 52.51846213843821, "longitude": 13.37872252544883338.897957 } ... }, "meta": { "timestamp": "2022-12-01T18:45:00.000Z" ... } }
In this example endpointed audio data is transfered as a value. There are other ways to send the audio data to the IPA, e.g., as a reference. This way is chosen as it is easier to illustrate the usage.
In return the the IPA may send back the following response Tomorrow there will be snow showers in Berlin with temperatures between 0 and -1 degrees via ClientResponse to the Client.
{ "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002", "requestId": "42", "audio": { "type": "Endpointed", "data": "Uvrs4hcGh2cGF3aGZwYWhuZ...vI0MDc4DGY1NiB5dGhvaRD2", "encoding": "PCM-16BIT" } "multimodal": { "text": "Tomorrow there will be snow showers in Berlin with temperatures between 0 and -1 degrees." ... }, "meta": { ... } }
The following request to processInput sends endpointed audio data with the user's current location to book a flight with the utterance I want to fly to San Francisco.
{ "sessionId": "0c27895c-644d-11ed-81ce-0242ac120002", "requestId": "15", "audio": { "type": "Endpointed", "data": "ZmhhcGh2cGF3aGZwYWhuZ...zI0MDc4NDY1NiB5dGhvaGF3", "encoding": "PCM-16BIT" } "multimodal": { "location": { "latitude": 52.51846213843821, "longitude": 13.37872252544883338.897957 } ... }, "meta": { "timestamp": "2022-11-14T19:50:00.000Z" ... } }
In return the the IPA may send back the following response When do you want to fly from Berlin to San Francisco? via ClientResponse to the Client
{ "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002", "requestId": "42", "audio": { "type": "Endpointed", "data": "Uvrs4hcGh2cGF3aGZwYWhuZ...vI0MDc4DGY1NiB5dGhvaRD2", "encoding": "PCM-16BIT" } "multimodal": { "text": "When do you want to fly from Berlin to San Francisco?" ... }, "meta": { ... } }
This interface describes the data that is sent from t the Provider Selection Service. The input is a copy of the data that is sent from the IPA Client to the IPA Service in Interface Client Input. This interface mainly differs in the return value. The following table details the data that should be considered for this interface in the method processInput.
As a return value this interface describes the data that is sent from the Provider Selection Service and the NLU and Dialog Management. The following table details the data that should be considered for this interface in the method ExternalClientResponse.
name | type | description | required |
---|---|---|---|
session id | data item | unique identifier of the session | yes, if the IPA requires the usage |
request id | data item | unique identifier of the request within a session | yes |
call result | data item | success or failure | yes |
multimodal output | category | output that has been received from an external IPA | yes, if no interpretation is provided and no error occurred |
interpretation | category | meaning as intents and associated entities | yes, if no multimodal output is provided and no error occurred |
error | category | error as detailed in section Error Handling | yes, if an error during execution is observed |
The parameters name, session id and request id are copies of the data received from the Interface Client Input.
This call is optional depending if external IPAs are used or not.
Depending on the capabilities of the external IPA the return value may be one of the following options
The category interpretation may be one of the following options, depending on the capabilities of the external IPA
With single-intent the user provides a single intent per utterance. An example for single-intent is "Book a flight to San Francisco for tomorrow morning." The single intent is here book-flight. With multi-intent the user provides multiple intents in a single utterance. An example for multi-intent is "How is the weather in San Francisco and book a flight for tomorrow morning." Provided intents are check-weather and book-flight. In this case the IPA needs to determine the order of intent execution based on the structure of the utterance. If not to be done in parallel, the IPA will trigger the next intent in the identified order.
As multi-intent is not very common in today's IPAs the focus for now is on single-intent as detailed in the following table
name | data type | description | required | ||
---|---|---|---|---|---|
interpretation | list | list of meaning as intents and associated entities | yes | ||
intent | string | group of utterances with similar meaning | yes | ||
intent confidence | float | confidence value for the intent in the range [0,1] | no | ||
entities | list | list of entities associated to the intent | no | ||
name of the entity | string | additional information to the intent | no | ||
entity confidence | float | confidence value for the entity in the range [0,1] | no |
The following request to processInput is a copy of Example Weather Information for Interface Client Input.
In return the the external IPA may send back the following response via ExternalClientResponse to the Dialog.
{ "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002", "requestId": "42", "callResult": "success", "interpretation": [ { "intent": "check-weather", "intentConfidence": 0.9, "entities": [ { "location": "Berlin", "entityConfidence": 1.0 }, { "date": "2022-12-02", "entityConfidence": 0.94 }, ] }, ... ] }
The external speech recognizer converts the obtained audio into text like How will be the weather tomorrow. The NLU then extracts the following from that decoded utterance, other multimodal input and metadata.
This is illustrated in the following figure.
The following request to processInput is a copy of Example Flight Reservation for Interface Client Input.
In return the the IPA may send back the following response When do you want to fly from Berlin to San Francisco? via ClientResponse to the Client. In this case, empty entities, like date indicate that there are still slots to be filled and no service call can be made right now.
{ "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002", "requestId": "42", "callResult": "success", "interpretation": [ { "intent": "book-flight", "intentConfidence": 0.87, "entities": [ { "origin": "Berlin", "entityConfidence": 1.0 }, { "destination": "San Francisco", "entityConfidence": 0.827 }, { "date": "", }, ... ] }, ... ] }
The external speech recognizer converts the obtained audio into text like I want to fly to San Francisco. The NLU then extracts the following from that decoded utterance, other multimodal input and metadata.
This is illustrated in the following figure.
Further steps will be needed to convert both location entities to origin and destination in the actual reply. This may be either done by the flight reservation IPA directly or by calling external services beforehand to determine the nearest airports from these locations.
This interface describes the data that is sent from the Dialog to the Provider Selection Service. The following table details the data that should be considered for this interface in the method callService.
name | type | description | required |
---|---|---|---|
session id | data item | unique identifier of the session | yes, if the IPA requires the usage |
request id | data item | unique identifier of the request within a session | yes |
service id | data item | id of the service to be executed | yes |
parameters | data item | Parameters to the service call | no |
As a return value the result of this call is sent back in the ClientResponse.
name | type | description | required |
---|---|---|---|
session id | data item | unique identifier of the session | yes, if the IPA requires the usage |
request id | data item | unique identifier of the request within a session | yes |
service id | data item | id of the service that was executed | yes |
call result | data item | success or failure | yes |
call result details | data item | detailed information in case of a failed service call | no |
error | category | error as detailed in section Error Handling | yes, if an error during execution is observed |
This call is optional depending on the result of the next dialog step if an external service should be called or not.
The following request to callService may be made to call the weather information service. Although calling the weather service is not a direct functionality of the IPA, it may help to understand how the entered data may be processed to obtain a spoken reply to the user's input.
{ "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002", "requestId": "42", "services": [ { "serviceId": "weather-service", "parameters": [ { "location": "Berlin", "date": "2022-12-02" } ] }, ... ] }
In return the the external service may send back the following response via ExternalClientResponse to the Dialog
{ "sessionId": "0d770c02-2a13-11ed-a261-0242ac120002", "requestId": "42", "callResult": "success", "services": [ { "serviceId": "weather-information", "callResult": "success", "callResultDetails": [ { "location": "Berlin", "date": "2022-12-02", "forecast": "snow showers", "minTemperature": -1, "maxTemperature": 0, ... } ] }, ... ] }
This information is the used to actually create a reply to the user as described in ExternalClientResponse to the Client.
Errors may occur anywhere in the processing chain of the IPA. The following gives an overview of how they are suggested to be handled.
Along the processing path errors may occur
Error messages carry the following information
name | type | description | required |
---|---|---|---|
error code | data item | unique error code that could be transformed into a IPA response matching the language and conversation | yes |
error message | data item | human-readable error message for logging and debugging | yes |
component id | data item | id of the component that has produced or handled the error | yes |
This section is still under preparation.
The Client Layer contains the main components that interface with the user.
Clients enable the user to access the IPA via voice. The following diagram provides some more insight.
The modality manager enables access to the modalities that are supported by the IPA Client. Major modalities are voice and text in case of chatbots. The following interfaces are supported
The Client Activation Strategy defines how the client gets activated to be ready to receive spoken commands as input. In turn the Microphone is opened for recording. Client Activation Strategies are not exclusive but may be used concurrently. The most common activation strategies are described in the table below.
Client Activation Strategy | Description |
---|---|
Push-to-talk | The user explicitly triggers the start of the client by means of a physical or on-screen button or its equivalent in a client application. |
Hotword | In this case, the user utters a predefined word or phrase to activate the client by voice. Hotwords may also be used to preselect a known IPA Provider. In this case the identifier of that IPA Provider is also used as additional metadata augmenting the input. This hotword is usually not part of the spoken command that is passed for further evaluation. |
Local Data Providers | In this case, a change in the environment may activate the client, for example if the user enters a room. |
... | ... |
The usage of hotwords includes privacy aspects as the microphone needs to be always active. Streaming to the components outside the user's control should be avoided, hence detection of hotwords should ideally happen locally. With regard to nested usage of IPAs that may feature their own hotwords, the detection of hotwords might be required to be extensible.
The Dialog Layer contains the main components to drive the interaction with the user.