JSON Representation of Semantic Information

Representing multimodal information in JSON Version 1.0

Latest version: Last modified: February 12, 2019 https://github.com/w3c/voiceinteraction/tree/master/voice%20interaction%20drafts/emmaJSON.htm (GitHub repository); HTML rendered version
Editor: Deborah A. Dahl, Conversational Technologies

Copyright © 2018-2019 the Contributors to JSON Representation of Semantic Information Report, Version 1.0, published by the Voice Interaction Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.

Abstract

This document describes a JSON format for representing the results of semantic processing. This format is derived from concepts in the Extensible Multimodal Annotation (EMMA) specification for representing multimodal processing results, including results from natural language understanding. Since the original EMMA specification was written, JSON has become very popular for representing data and is used in a number of current natural language understanding toolkits. However, the JSON formats used in the different commercial toolkits vary significantly. For this reason, a common JSON representation that is capable of including all of the rich metadata that can be represented in EMMA is worth investigating. This would provide a common data interchange format across natural language understanding and other cognitive toolkits while taking advantage of the extensive existing JSON ecosystem.

Status of This Document

TThis report was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

Comments should be sent to the Voice Interaction Community Group public mailing list (public-voiceinteraction@w3.org), archived at https://lists.w3.org/Archives/Public/public-voiceinteraction

Discussion
Example
Issues
References

1. Discussion

The XML EMMA specification contains features for representing extremely rich semantic information about utterances and other multimodal inputs. For example, information that can be represented in EMMA includes timing, alternatives, confidences, multimodal inputs, system outputs, human annotations, failures (no-input and uninterpreted inputs), language of input,and streaming results, among many other types of information. It would be extremely useful to be able to represent some of this information in JSON natural language results, especially if it was represented in a standard form that could be used across platforms.

In addition, while the original EMMA specification placed few constraints on the format of application-specific semantic data, current natural language toolkits have been more or less converging on common terminology such as "intents" and "entities". For this reason it seems to be useful to include these concepts in a JSON-based natural language results format. This document proposes a format that includes both EMMA metadata as well as application-specific semantic information in the form of "intents" and "entities". The format presented here is simply a strawperson proposal that is intended to spark discussion.

2. Example

2.1 EMMA XML

This document will illustrate the semantic relationship between XML EMMA and the current proposal by using the following example, from the EMMA specification. The example shows the results of natural language processing for the utterance "flights from boston to denver". In this case the result is an nbest list (indicated by one-of) with two alternative speech recognition results, one for "flights from boston to denver" and the other for another possible recognition result, "flights from austin to denver", as interpretations. Metadata in the example includes the recognition result ("tokens"), confidences, the start and end times of the utterance, and the medium ("acoustic") and mode ("voice") of the input. The last two attributes enable the format to be usable for representing multimodal inputs.

Example:

<emma:emma version="2.0"
    xmlns:emma="http://www.w3.org/2003/04/emma"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2003/04/emma
     http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
    xmlns="http://www.example.com/example">
  <emma:one-of id="r1" emma:start="1087995961542" emma:end="1087995963542"
     emma:medium="acoustic" emma:mode="voice">

    <emma:interpretation id="int1" emma:confidence="0.75"
    emma:tokens="flights from boston to denver">
      <origin>Boston</origin>
      <destination>Denver</destination>
    </emma:interpretation>

    <emma:interpretation id="int2" emma:confidence="0.68"
    emma:tokens="flights from austin to denver">
      <origin>Austin</origin>
      <destination>Denver</destination>
    </emma:interpretation>
  </emma:one-of>
</emma:emma>

2.2 JSON for this example

The JSON in this example was prepared based on the following heuristics, which could be extended to cover other features that are not included in the example.

XML attributes are mapped to key-value pairs.
Namespaces have been removed for simplicity.
Arrays are used for the children of container elements (one-of, group, sequence).
There are concepts for representing application-specific semantics, specifically: intent, intents, entities, entity, name, intentName, value, and type.
confidence and tokens are extended for use in representing application-specific semantics.
Note that the intents array provides for multiple intents per utterance, even though this example utterance includes only one intent.

{
	"emma": {
		"version": 2.0,
		"one-of": {
			"id": "r1",
			"start": 1087995961542,
			"end": 1087995963542,
			"medium": "acoustic",
			"mode": "voice",
			"interpretations": [{
					"interpretation": {
						"id": "int1",
						"confidence": 0.90,
						"tokens": "flights from boston to denver",
						"intents": [{
							"intent": {
								"intentName": "flightSearch",
								"confidence": 0.99,

								"entities": [{
										"entity": {
											"id": "e1",
											"confidence": 0.75,
											"tokens": "boston",

											"name": "origin",

											"value": "Boston",

											"type": "city"

										}
									},
									{
										"entity": {
											"id": "e2",
											"confidence": 0.75,
											"tokens": "denver",

											"name": "destination",

											"value": "Denver",

											"type": "city"

										}
									}
								]
							}
						}]
					}
				},
				{
					"interpretation": {
						"id": "int2",
						"confidence": 0.75,
						"tokens": "flights from austin to denver",
						"intents": [{
							"intent": {
								"intentName": "flightSearch",
								"confidence": 0.99,

								"entities": [{
										"entity": {
											"id": "e1",
											"confidence": 0.75,
											"tokens": "austin",

											"name": "origin",

											"value": "Austin",

											"type": "city"

										}
									},
									{
										"entity": {
											"id": "e2",
											"confidence": 0.75,
											"tokens": "Denver",

											"name": "destination",

											"value": "Denver",

											"type": "city"

										}
									}
								]
							}
						}]
					}
					}
				]
			}
		}
	}

2.3 Visualization

The image below illustrates the proposed JSON structure for the example. The light blue nodes are original EMMA elements, and the dark blue nodes are original EMMA attributes. The green nodes are introduced in the JSON version. The light green nodes are JSON arrays for grouping concepts and the dark green nodes are new concepts introduced in this document for representing application-specific semantics. The edges show parent-child relationships. intents and entities

3. Issues

XML EMMA can currently embed other XML languages (EmotionML, SSML, InkML), how would we handle this in a JSON version? Do we need to handle it?
Representation of nested or complex entities, where an entity value is made up of other entities. See the example in the appendix
Some toolkits also have JSON or XML model formats for annotated training data. Is the JSON procesing result agnostic to the model format?
The character span in the input that was matched to an entity could be included.
More examples are needed.

4. References

5. Acknowledgments

Thanks to Michael Johnston for comments on an earlier draft.

6. Appendix: Nested Example

An example of a possible representation for complex entities where the "value" of an entity is a group of entities, which are shown with a heavy border in the graphic. This food order example contains two items, an entre and a drink, each of which includes additional entities ("size", "topping", or "mainDish". This example shows one interpretation from the "one-of" nbest list for the utterance" I want a large mushroom pizza and a small coke".