Copyright © 2019-2024 the Contributors to the Voice Interaction Community Group, published by the Voice Interaction Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.
This document describes a general architecture of Intelligent Personal Assistants and explores the potential for standardization. It is meant to be a first structured exploration of Intelligent Personal Assistants by identifying the components and their tasks. Subsequent work is expected to detail the interaction among the identified components and how they ought to perform their task as well as their actual tasks respectively. This document may need to be updated if any changes result of that detailing work. It extends and refines the description of the previous versions Architecture and Potential for Standardization Version 1.2. The changes primarily consist of clarifications and additional architectural details in new and expanded figures, include input and output data paths.
This specification was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.
Comments should be sent to the Voice Interaction Community Group public mailing list (public-voiceinteraction@w3.org), archived at https://lists.w3.org/Archives/Public/public-voiceinteraction
Intelligent Personal Assistants (IPAs) are now widely available in our daily lives and can be accessed in many ways. Apple’s Siri, Google Assistant, Microsoft’s Cortana, Samsung’s Bixby and many more are helping us with various tasks, like shopping, playing music, setting a schedule, sending messages, and offering answers to simple questions. Additionally, we equip our households with smart speakers like Amazon’s Alexa or Google Home which are available without the need to pick up explicit devices for these sorts of tasks or even control household appliances in our homes.
Besides Intelligent Personal Assistants, also other names are used, like conversational agents, chatbots, virtual assistants, or digital assistants. While the interpretation of these terms may vary, we will use the term Intelligent Personal Assistant (IPA) in this document. The term personal in this name refers to a potential ability of the assistant to learn about the user and adaptation. It neither implies that the assistant must have this capability nor that the interaction is personal in the sense that no data is shared.
As of today, there is no interoperability among the available IPA providers. Especially for exchanging learned user behaviors this is unlikely to happen at all.
Furthermore, in addition to these general-purpose assistants, there are also specialized virtual assistants which are able to provide their users with in-depth information which is specific to an enterprise, government agency, school, or other organization. They may also have the ability to perform transactions on behalf of their users, such as purchasing items, paying bills, or making reservations. Because of the breadth of possibilities for these specialized assistants, it is imperative that they be able to interoperate with the general-purpose assistants. Without this kind of interoperability, enterprise developers will need to re-implement their intelligent assistants for each major generic platform.
The recent increase in the availability of Large Language Models (LLMs) has greatly improved the natural language processing capabilities of IPAs. These technical improvements can be accomodated by variations of the architecture, as described in sections 3.2.4 and 3.2.5 and shown in figures 2a, 2b and 3.
This document is a first step in our strategy for IPA standardization. It describes a general architecture of IPAs and explores the potential areas for standardization. It focuses on voice as the major input modality. We believe it will be of value not only to developers, but to many of the constituencies within the intelligent personal assistant ecosystem. Enterprise decision-makers, strategists and consultants, and entrepreneurs may study this work to learn of best practices and seek adjacencies for creation or investment. The overall concept is not restricted to voice but also covers purely text based interactions with so-called chatbots as well as interaction using multiple modalities. Conceptually, the authors also define executing actions in the user's environment, like turning on the light, as a modality. This means that components that deal with speech recognition, natural language understanding or speech synthesis will not necessarily be available in these deployments. In case of chatbots, speech components will be omitted. In case of multimodal interaction, interaction modalities may be extended by components to recognize input from the respective modality, transform it into something meaningful and vice-versa to generate output in one or more modalities. Some modalities may be used as output-only, like turning on the light, while other modalities may be used as input-only, like touch.
Currently, users are mainly using the IPA Provider that is shipped with a certain piece of hardware. Thus, selection of a smart phone manufacturer actually determines which IPA implementation they are using. Switching among different IPA providers also involves switching the manufacturer, which requires high costs and getting used to a new user interface specific to the new manufacturer. On the one hand users should have more freedom in selecting the IPA implementation they want. However, they are bound to use the service that is available in that implementation but which may not be what they necessarily prefer. On the other hand, IPA providers, which mainly produce the software, must also function as hardware manufacturers to be successful.
Moreover, we are also seeing the emergence of independent conversational agents, owned and operated by independent enterprises, and built on either white label platforms or of best-of-breed components by 3rd party development agencies. This may largely free IPA development from hardware. Such a market transition creates an ever greater impetus for this work.
Finally, manufacturers also have to take care to port existing services to their platform. Standardization would clearly lower the needed efforts for porting and thus reduce costs. Additionally, it may also pave the way for interoperability among available IPA providers. Tasks may be transferred, partially or completely to other IPAs.
In order to explore the potential for standardization, a typical usage scenario is described in the following section.
This section describes potential usages of IPAs.
A user would like to plan a trip to an international conference and she needs visa information and airline reservations. She will give the intelligent personal assistant (IPA) her visa information (her citizenship, where she is going, purpose of travel, etc.) and it will respond by telling her the documentation she needs, how long the process will take and what the cost will be. This may require the personal assistant to consult with an auxiliary web service or another personal assistant that knows about visas.
Once the user has found out about the visa, she tells the IPA that she wants to make airline reservations. She specifies her dates of travel and airline preferences and the IPA then interacts with her to find appropriate flights.
A similar process will be repeated if the user wants to book a hotel, find a rental car, or find out about local attractions in the destination city. Booking a hotel as part of attending a conference could also involve finding out about a designated conference hotel or special conference rates, which, again, could require interaction with the hotel or the conference's IPA's.
User encounters emergency situations that requires them to use their hands while administering medical care, driving or operating machinery. Manual interactions on control panels, keyboards or touch pads can impede life saving activities and diminish focus while operating sensitive vehicles, devices and machinery. User would benefit from a secure, interoperable, voice interactive system that can be used to access necessary information, keeping hands free to perform these actions.
Examples of emergency applications include:
All of these use cases benefit from voice interaction systems that have:
A user is driving a car and sees a sign of a histroical site near the road. She becomes curious and wants to know more about the site. She then activates the in-vehicle chatbot and asks about the site. The chatbot then provides her with information about the site. One aspect of the answer attracts the special attention of the user and she wants to dig deeper. She then asks the chatbot to provide her with some more details. As she continues asking questions she is able to spend her driving time to actively learn.
Currently available generative AI models are ideal candidates for this use case. They can provide the user with a corresponding conversation thanks to the useage of LLMs.
The following roles and responsibilities following the RACI (responsible, accountable, consulted, informed) are identified
Role | R | A | C | I |
---|---|---|---|---|
Platform provider | x | x | ||
Content Owner | x | x | ||
Developer | x | x | ||
Designer and Application Developer | x | |||
System Integrator | x | |||
User |
In order to cope with such use cases as those described above an IPA follows the general design concepts of a voice user interface, as can be seen in Figure 1.
The architecture described in this document follows the SOLID principle introduced by Robert C. Martin to arrive at a scalable, understandable and reusable software solution.
This architecture aims at following both, a traditional partitioning of conversational systems, with separate components for speech recognition, natural language understanding, dialog management, natural language generation, and audio output, (audio files or text to speech) as well as newer based approaches utilizing generative AI. This architecture does not rule out combining some of these components in specific systems.
The following figure 1 shows the basic architecture of an IPA
Both architectures aim at serving, among others, the following most popular high-level use cases for IPAs
This is supported by a flexible architecture that supports dynamically adding local and remote services or knowledge sources such as data providers. Moreover, it is possible to include other IPAs, with the same architecture, and forward requests to them, similar to the principle of a Russian doll (omitting the Client Layer). All this describes the capabilities of the IPA. These extensions may be selected from a standardized marketplace. For the reminder of this document, we consider an IPA that is extendible via such a marketplace.
Not all components may be needed for actual implementations, some may be omitted completely. However, we note them here to provide a more complete picture. This architecture comprises three layers that are detailed in the following sections
Actual implementations may want to distinguish more than these layers. The assignment to the layers is not considered to be strict so that some of the components may be shifted to other layers as needed. This view only reflects a view that the Community Group regard as ideal and to show the intended separation of concerns.
There are also no assumptions about the location of the layers and associated components. They may be on a single device, distributed across different devices or even in the cloud. With regard to privacy, it is recommended not to send any data to the cloud. If necessary, it should be only sent to trusted entities and encrypted.
The Client Layer contains the main components that interface with the user. The following figure details the view onto the Client Layer shown in Figure 1.
Capture devices or modality recognizers are used to capture multimodal user input, such as voice or text input. Additional input modalities can be employed that capture input with a specific modality recognizers. Additional input may be gathered from Local Data Providers
For capture devices the following modality types are defined
Modality Type | Description |
---|---|
voice | Input was provided via spoken language |
text | Input was provided via written language |
... | ... |
The microphone is used to capture the voice input of a user as a primary input modality with the modality type voice.
The keyboard may be optionally used to capture the text input if the IPA accepts this text input modality.
Presentation devices or modality synthesizers are used to provide system output to the user. Additional output modalities can be employed that render their output with a specific modality synthesizer. It is not always required that a verbal auditory output is made as a reply to a user. The user can also become aware of the output as a consequence of an observable action as a result of a Local Service within the Client Layer or an External Services call from the External Data / Services / IPA Providers Layer. In these cases an additional nonverbal auditory output may be considered.
For presentation devices the following modality types are defined
Modality Type | Description |
---|---|
voice | Output was provided via spoken language |
audio | Output was provided via raw audio |
text | Output was provided via written language |
... | ... |
The loudspeaker is used to output replies as verbal auditory output in the shape of spoken utterances as a primary output modality. Utterances may be accompanied by nonverbal auditory output such as
Verbal auditory output is in the modality type voice while nonverbal auditory output is in the modality type audio.
The display may be optionally used to present text output if the IPA supports this output modality.
Clients enable the user to access the IPA via voice with the following characteristics.
The Client Activation Strategy defines how the client gets activated to be ready to receive spoken commands as input. In turn the Microphone is opened for recording. Client Activation Strategies are not exclusive but may be used concurrently. The most common activation strategies are described in the table below
Client Activation Strategy | Description |
---|---|
Push-to-talk | The user explicitly triggers the start of the client by means of a physical or on-screen button or its equivalent in a client application. |
Hotword | In this case, the user utters a predefined word or phrase to activate the client by voice. Hotwords may also be used to preselect a known IPA Provider. In this case the identifier of that IPA Provider is also used as additional metadata augmenting the input. This hotword is usually not part of the spoken command that is passed for further evaluation. |
Gesture-to-talk | The user triggers the start of the client by means of a gesture, e.g. raising the hand to be detected by a sensor. |
Local Data Providers | In this case, a change in the environment may activate the client, for example if the user enters a room. |
... | ... |
The usage of hotwords includes privacy aspects as the microphone needs to be always active. Streaming to the components outside the user's control should be avoided, hence detection of hotwords should ideally happen locally. With regard to nested usage of IPAs that may feature their own hotwords, the detection of hotwords might be required to be extensible.
A registry for all Local Services and Local Data Providers that can be accessed by the client
Local services can be used to execute local actions in the user's local environment. Examples include turning on the light or starting an application, for instance a navigation system in a car.
Local Data Providers capture input that is accessible in the user's local environment. They can be used to provide additional input to the IPA Client or to provide additional information that is needed to execute services. An example for the latter is the state of the light, either turned on or turned off.
The Dialog Layer contains the main components to drive the interaction with the user. The following figure details the high-level view of the Dialog Layer shown in Figure 1. The dialog layer may either be traditionally NLU-based as shown in Figure 2 a) or based on Generative AI as shown in Figure 2 b).
Both, Traditional NLU and Generative AI aim at similar goals. Traditional NLU systems try to understand the user's intent and associated data by parsing her input converted to text. Subsequent actions derived from that require processing in other dedicated components. Generative AI systems generate new data based on larger text models.
The general IPA Service API mediates between the user and the overall IPA system. The service layer may be omitted in case the IPA Client communicates directly with Dialog Manager. However, this is not recommended as it may contradict the principle of separation-of-concerns. It has the following characteristics
Dialog execution can be governed by sessions, e.g. to free resources of e.g., ASR, NLUa> or TTS engines when a session expires. Linguistic phenomena, like anaphoric references and ellipsis, are expected to work within a session. Conceptually, multiple sessions can be active in parallel on a single IPA depending on the capabilities of the IPA. In LLM based systems, the conversational history is expected to be considered within a session.
The selected IPA Providers or the Dialog Manager may have leading roles for the task of session management.
A session begins when
may continue over multiple interaction turns, i.e. an input and output cycle, and ends
This includes the possibility that a session may persist over multiple requests.
The Automated Speech Recognizer (ASR) receives audio streams of recorded utterances and generates a recognition hypothesis as text strings for the local IPA. Conceptually, ASR is a modality recognizer for speech. It has the following characteristics
This section describes the components that are specific for traditional NLU-based systems
An Natural Language Understanding (NLU) component that able to extract meaning as intents and associated entities from an utterance as text strings.
The NLU has the following characteristics
The Dialog Manager is a component that receives semantic information determined from user input, updates the dialog history, its internal state, decides upon subsequent steps to continue a dialog and provides output, mainly as synthesized or recorded utterances. Conceptually the dialog manager defines the playground that is used by the Dialogs and contributes significantly to the user experience. The Dialog Manager is available in traditional NLU based systems and has the following characteristics
A Dialog Strategy is a conceptualization of a dialog for an operationalization in a computer system. It defines the representation of the dialog's state and respective operations to process and generate events relevant to the interaction. This specification is agnostic to the employed Dialog Strategy. Examples of dialog strategy include
Dialog Strategy | Example |
---|---|
State-based | State Chart XML (SCXML): State Machine Notation for Control Abstraction |
Frame-based | Voice Extensible Markup Language (VoiceXML) 2.1 |
Plan-based | Information State Update |
Dialog State Tracking | Machine Learning for Dialog State Tracking: A Review |
... | ... |
The natural language generation (NLG) component is responsible for preparing the natural language text that represents the system’s output. NLG is not needed in LLM based architectures. It has the following characteristics
This section describes the components that are specific for generative AI-based systems
Prompt adaptation is the process of adjusting the output of the IPA to the user's needs. This may include adjusting the prompt to the user's preferences, the user's current context, or the user's current environment. This component is not needed in traditionally NLU-based systems but is currently essential for LLM based systems. It may become optional if they become capable of handling that. It has the following characteristics
LLMs stands for Large Language Models. These models can conceptually be perceived as a special type of a Dialog Manager that also include the NLU and NLG components. It is not needed in NLU-based systems and has the following characteristics
After the output of the LLM is received additional post-processing steps may be applied to make the output more accurate and reliable. Examples include validity checks or error handling. This optional component has the following characteristics
During the interaction with a user all kinds of information are collected and managed in the so-called conversation context or dialog context. It contains all the short and long term information needed to handle a conversation and thus may exceed the concept of a session. It also serves for context-based reasoning with the help of the Knowledge Graph and to generate output for the output to the user NLG. It is not possible to capture each and every aspect of what context should comprise as discussions about context are likely to end up in trying to explain the world. For the sake of this specification it should be possible to deal with the following characteristics
The Dialog History mainly stores the past dialog events per user. Dialog events include users’ transcriptions, semantic interpretations and resulting actions. Thus, it has information on how the user reacted in the past and knows her preferences. The history may also be used to resolve anaphoric references in the NLU or can be used as temporary knowledge in the Knowledge Graph.
Generative AI models may also use the history to add conversational context to the prompt.
The system uses a knowledge graph, e.g., to reason about entities and intents. This may be received from the detected input from the NLU or Data Providers to come up with some more meaningful data matching the current task better. One example is the use of the name of a person as a navigation target as a person usually has an address that qualifies to be used in navigation tasks.
Generative AI models may also use the history to add conversational context to the prompt.
The Text-to-Speech (TTS) component receives text strings, which it converts into audio data. Conceptually, the TTS is a modality specific renderer for speech. It has the following characteristics
Dialogs support interaction with the user. They include Core Dialogs, which are built into the system, and provide basic interactions, as well as more specialized dialogs which support additional functionality.
The Core Dialog are logical entities that are able to handle basic functionality via Core Intent Sets to enable interaction with the user at all. This includes among others
Core Dialog | Purpose |
---|---|
Greeting | Welcome the user and and prepare for initial input. |
Help | The user asked for more guidance. |
Goodbye | Terminate the interaction with the user. |
Service not available | The dialog relies on reaching out for a specific service but was not able to reach it, e.g. because of connection issues. |
Intent not known | The Provider Selection Service returned an intent that can not be handled by a corresponding Dialog. |
No input | The user did not say anything within a predefined timespan |
Error | An unknown error occurred, see also error handling |
Transfer to external IPA Provider | Notify the user that the following dialog steps will be handled outside the scope of this IPA. |
... | ... |
Conceptually, the Core Dialog is a special Dialog as described in the following section that is always available.
In Traditional NLU systems, a Dialog is able to handle functionality that can be added to the capabilities of the Dialog Manager through its associated Intent Sets. Dialogs are logical entities within the overall description of the interaction with the user, executed by the Dialog Manager. In Generative AI" systems dialogs are more of the nature of templates to pre-scribe a response to the user. They may also go beyond that to also meet requirements for agentic applications.
Dialogs must serve different purposes in the sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time. Dialogs have the following characteristics
A Core Intent Set usually identifies tasks to be executed and defines the capabilities of the Core Dialog. Conceptually, the Core Intent Sets are Intent Sets that are always available.
Intent Sets define actions, identified by the name of the intent, along with their parameters as entities as it is produced by the NLU in Traditional NLU systems that can be consumed by a corresponding Dialog. In Generative AI" systems they can be perceived as an identification of a dialog along with parameters that are filled into a template. They have the following characteristics
The Dialog X's are able to handle functionality that can be added to the capabilities of the Dialog Manager through their associated Intent Set X. A Dialog X extends the Core Dialogs and add functionality by custom Dialogs. The Dialog X's must server different purposes in a sense that they are unique for a certain task. E.g., only a single flight reservation dialog may exist at a time. They have the same characteristics as a Dialog.
An Intent Set X is a special Intent Set that identifies tasks that can be executed within the associated Dialog X.
The Dialog Registry manages all available Dialogs with their associated Intent Sets with respect to the current Dialog Strategy. This means, it is the Dialog Registry that would know which Dialog to use for a given intent. For some Dialog Strategy this component may be omitted as it is taken over by the Dialog Manager. One of these cases is when the Dialog Strategies does not allow for the dynamic handling of Dialogs as described below.
A service that provides access to all known Data Providers, External Services and IPA Providers. This service also maps the IPA Intent Sets to the Intent Sets in the Dialog layer for Traditional NLU systems. For Generative AI systems it serves to enable agentic applications. It has the following characteristics
The Provider Selection Strategy aims at determining those IPA Providers that are most likely suited to handle the current input. Generally,the system should not make any assumptions about the user's current input as she may switch goals with each input but there may be some deviating use cases. The provider selection strategy may be implemented for example as one of the following options or a combination thereof to determine a list of IPA Providers candidates.
In case the IPA Provider does not abstract from determining a relevant list of intents, the same strategy may be applied to determine the n-best intents.
A registry for all IPA Providers that can be accessed. There are several options how the provider registry gets to know which IPA Providers are available, for example
Not all of them must be supported. The latter two can make use of standardized descriptions of the IPA Providers like the Assistant Manifest from OVON.
Besides that, it It has the following characteristics
A registry that knows how to access the known IPA Providers, i.e., which are available and credentials to access them. Storing of credentials must meet security and trust considerations that are expected from such a personalized service. It has the following characteristics
A registry for all External Services and Data Providers that can be accessed by the client
Data Providers obtain data from various external sources for use in the interaction, for example, data obtained from a third-party web service.
A data provider to get data to be used in the Dialog, e.g. as a result of a query.
External Services provide access to trigger actions outside of the system; for example, triggered from a third-party web service.
A specific External Service, which provides output of the system, e.g. through an application can use multiple External Services.
IPA providers provide IPA's that can interact with users in an application.
In this sense an IPA might be again a fully fledged IPA, with the exception of the Client Layer as this IPA will take over the role of a client to the nested IPA. Actually, this can be perceived as the Matryoshka (or Russian Doll) principle1. Each IPA may be perfectly used as is but can also be approached by other IPAs. Nested IPAs may be both, traditional NLU-based or Gererative AI based IPAs.
A provider of an IPA service, like
The IPA provider may be part of the IPA implementation as an IPA Provider or alternatively a subset of the original functionality as described below as part of another IPA implementation.
The previous sections showed a more detailed view onto the architectural buildings blocks. A general overview comprising these detailing is shown in the following figure. Note, that both NLU-based as well as Generative AI based architectures are combined and only those components are needed that are needed in the envisioned IPA type.
Errors may occur anywhere in the processing chain of the IPA. The following gives an overview of how they are suggested to be handled.
Along the processing path errors may occur
As a consequence of the latter, components must be prepared to receive an error message or a list thereof instead of the actually expected data. Errors should only be forwarded in case there are no valid continuations that have a chance to provide a response to the IPA user. Subsequent components may be able to handle the error accordingly or to convert them into a reply to the user.
In case multiple errors are received the component should try (in the following order) to
In case errors could be handled it is recommended to log the errors for debugging.
An error message should contain at least
It can optionally contain
This section needs to be updated to match the changes as introduced above.
This section expands on the use case above, filling in details according to the sample architecture.
A user would like to plan a trip to an international conference and she needs visa information and airline reservations.
The user starts by asking a general purpose assistant (IPA Client, on the left of the diagram) about what the visa requirements are for her situation. For a common situation, such as citizens of the EU traveling to the United States, the IPA is able to answer the question directly from one of its dialogs 1-n getting the information from a web service that it knows about via the corresponding Data Provider. However, for less common situations (for example, a citizen of South Africa traveling to Japan), the generic IPA will try to identify a visa expert assistant application from the dialog registry. If it finds one, it will connect the user with the visa expert, one of the IPA providers on the right side. The visa expert will then engage in a dialog with the user to find out the dates and purposes of travel and will inform the user of the visa process.
Once the user has found out about the visa, she tells the IPA that she wants to make airline reservations. If she wants to use a particular service, or use a particular airline, she would say something like "I want to book a flight on American". The IPA will then either connect the user with American's IPA or, if American doesn't have an IPA, will inform the user of that fact. On the other hand, if the user doesn't specify an airline, the IPA will find a general flight search IPA from its registry and connect the user with the IPA for that flight search service. The flight search IPA will then interact with the user to find appropriate flights.
A similar process would be repeated if the user wants to book a hotel, find a rental car, find out about local attractions in the destination city, etc. Booking a hotel could also involve interacting with the conference's IPA to find out about a designated conference hotel or special rates.
This section provides a detailed walkthrough for an NLU-based IPA that aligns the steps in the use case interaction with the architecture. It covers only the part from the example above that the user asks for a flight travel with a dedicated airline. This very basic example assumes that this is the first request to IPA and that there is a suitable dialog ready that matches the user's request. It may also vary, e.g., depending on the used Dialog Strategy and other optional items that may actually result in different flows. The walkthrough is split into two parts for the input path and for the output path.
We begin with the case where the user's request can be handled by one of the internal Dialogs in the Dialog box. The input side is illustrated in the following figure. For the sake of completeness the complete architecture is shown although the Generative AI component is not used. Therefore, this part is shown in a lighter color.
The output path begins where the local NLU and IPA Providers are able to deliver their results. In both paths the best match for the intents and entities based on the received data have been identified. This path is illustrated in the following figure
The general architecture of IPAs described in this document should be detailed in subsequent documents. Further work must be done to
Currently, the authors see the following situation at the time of writing
Component | Potentially related standards |
---|---|
IPA Client | |
IPA Service | none |
Dialog Manager | |
TTS | |
ASR | |
Core Dialog | |
Core Intent Set | none |
Dialog Registry | |
Provider Selection Service | none |
Accounts/Authentication | |
NLU | |
Knowledge Graph | |
Data Provider | none |
The table above is not meant to be exhaustive nor does it claim that the identified standards are suited for IPA implementations. They must be analyzed in more detail in subsequent work. The majority are starting points for further refinement. For instance, the authors consider it unlikely that VoiceXML will actually be used in IPA implementations.
Out of scope of a possible standardization is the implementation inside the IPA Providers and potential interoperability among them. However, it eases the the integration of their exposed services or even allow to use services across different providers. Actual IPA providers may make use of any upcoming standard to enhance their deployments as a marketplace of intelligent services.
This version of the document was written with the participation of members of the W3C Voice Interaction Community Group. The work of the following members has significantly facilitated the development of this document:
Abbreviation | Description |
---|---|
ASR | Automated Speech Recognition |
LLM | Large Language Model |
NLG | Natural Language Generation |
NLU | Natural Language Understanding |
TTS | Text to Speech |