Copyright © 2018-2019 the Contributors to JSON Representation of Semantic Information Report, Version 1.0, published by the Voice Interaction Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.
This document describes a JSON format for representing the results of semantic processing. This format is derived from concepts in the Extensible Multimodal Annotation (EMMA) specification for representing multimodal processing results, including results from natural language understanding. Since the original EMMA specification was written, JSON has become very popular for representing data and is used in a number of current natural language understanding toolkits. However, the JSON formats used in the different commercial toolkits vary significantly. For this reason, a common JSON representation that is capable of including all of the rich metadata that can be represented in EMMA is worth investigating. This would provide a common data interchange format across natural language understanding and other cognitive toolkits while taking advantage of the extensive existing JSON ecosystem.
TThis report was published by the Voice Interaction Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.
Comments should be sent to the Voice Interaction Community Group public mailing list (public-voiceinteraction@w3.org), archived at https://lists.w3.org/Archives/Public/public-voiceinteraction
The XML EMMA specification contains features for representing extremely rich semantic information about utterances and other multimodal inputs. For example, information that can be represented in EMMA includes timing, alternatives, confidences, multimodal inputs, system outputs, human annotations, failures (no-input and uninterpreted inputs), language of input,and streaming results, among many other types of information. It would be extremely useful to be able to represent some of this information in JSON natural language results, especially if it was represented in a standard form that could be used across platforms.
In addition, while the original EMMA specification placed few constraints on the format of application-specific semantic data, current natural language toolkits have been more or less converging on common terminology such as "intents" and "entities". For this reason it seems to be useful to include these concepts in a JSON-based natural language results format. This document proposes a format that includes both EMMA metadata as well as application-specific semantic information in the form of "intents" and "entities". The format presented here is simply a strawperson proposal that is intended to spark discussion.
one-of
) with two alternative speech recognition results, one for "flights from boston to denver" and the other for another possible recognition result, "flights from austin to denver", as interpretations. Metadata in the example includes the recognition result ("tokens"), confidences, the start and end times of the utterance, and the medium ("acoustic") and mode ("voice") of the input. The last two attributes enable the format to be usable for representing multimodal inputs.
Example:
<emma:emma version="2.0"
xmlns:emma="http://www.w3.org/2003/04/emma"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2003/04/emma
http://www.w3.org/TR/2009/REC-emma-20090210/emma.xsd"
xmlns="http://www.example.com/example">
<emma:one-of id="r1" emma:start="1087995961542" emma:end="1087995963542"
emma:medium="acoustic" emma:mode="voice">
<emma:interpretation id="int1" emma:confidence="0.75"
emma:tokens="flights from boston to denver">
<origin>Boston</origin>
<destination>Denver</destination>
</emma:interpretation>
<emma:interpretation id="int2" emma:confidence="0.68"
emma:tokens="flights from austin to denver">
<origin>Austin</origin>
<destination>Denver</destination>
</emma:interpretation>
</emma:one-of>
</emma:emma>
one-of
, group
, sequence
).intent
, intents
, entities
, entity
, name
, intentName
, value
, and type
.confidence
and tokens
are extended for use in representing application-specific semantics.intents
array provides for multiple intents per utterance, even though this example utterance includes only one intent
.{ "emma": { "version": 2.0, "one-of": { "id": "r1", "start": 1087995961542, "end": 1087995963542, "medium": "acoustic", "mode": "voice", "interpretations": [{ "interpretation": { "id": "int1", "confidence": 0.90, "tokens": "flights from boston to denver", "intents": [{ "intent": { "intentName": "flightSearch", "confidence": 0.99, "entities": [{ "entity": { "id": "e1", "confidence": 0.75, "tokens": "boston", "name": "origin", "value": "Boston", "type": "city" } }, { "entity": { "id": "e2", "confidence": 0.75, "tokens": "denver", "name": "destination", "value": "Denver", "type": "city" } } ] } }] } }, { "interpretation": { "id": "int2", "confidence": 0.75, "tokens": "flights from austin to denver", "intents": [{ "intent": { "intentName": "flightSearch", "confidence": 0.99, "entities": [{ "entity": { "id": "e1", "confidence": 0.75, "tokens": "austin", "name": "origin", "value": "Austin", "type": "city" } }, { "entity": { "id": "e2", "confidence": 0.75, "tokens": "Denver", "name": "destination", "value": "Denver", "type": "city" } } ] } }] } } ] } } }
Thanks to Michael Johnston for comments on an earlier draft.
An example of a possible representation for complex entities where the "value" of an entity is a group of entities, which are shown with a heavy border in the graphic. This food order example contains two items, an entre and a drink, each of which includes additional entities ("size", "topping", or "mainDish". This example shows one interpretation from the "one-of" nbest list for the utterance" I want a large mushroom pizza and a small coke".