This document describes use cases and requirements for a proposed API for a Next Generation Audio (NGA) interface in a web environment, enabling web applications to provide user interfaces to control NGA experiences.
In contrast to traditional audio codecs (e.g. AAC), NGA codecs utilize new ways of object-based audio coding and in-band transmission of a variety of metadata to control the playback-side rendering of audio content. The proposed API exposes information describing NGA scene metadata and allows interacting with the respective codec to individualize the user experience during audio playback.
The API is envisioned to be codec-agnostic, supporting various NGA implementations while maintaining compatibility with protected media content.
This document is a work in progress and serves as a base for discussion. Feedback is highly appreciated, particularly regarding the optimal placement of the API within the web platform architecture.
In contrast to traditional audio codecs (e.g. AAC), Next Generation Audio (NGA) codecs utilize new ways of object-based audio coding and in-band transmission of a variety of metadata to control the playback-side rendering of audio content. Object-based audio coding enables content providers to deliver individual audio components to the playback-side in a flexible way. The components can be combined into a single bitstream or separated into multiple bitstreams.
NGA codecs provide a rich set of metadata and allow users to change certain aspects of an audio presentation, including:
An extensive set of standardized NGA metadata is crucial for adapting to the given listening environment and therefore for delivering an immersive audio experience, enabling the system to adapt to user preferences, and enabling advanced user interaction possibilities. The same set of metadata that allows the user to interact with the content can be used to facilitate corresponding application interfaces.
Maybe the most basic metadata in the context of NGA is a grouping of objects and settings into so called [=preselections=]. [=Preselections=] are pre-baked combinations of object gain settings which can be activated or deactivated, and effectively set defaults for the parameters underneath. Preselections can also expose more customization options, e.g. changing the dialogue loudness within a range based on the metadata, and thus serve as the entry point for any further customization.
The envisioned API should expose information describing said metadata of the NGA scene and should allow interacting with the respective codec to individualize the user experience during audio playback. This document does not prejudge on where the API should ideally be placed. For example, it could be added directly to {{HTMLMediaElement}} or to the corresponding {{AudioTrack}} object representing an NGA track.
The following use cases are all audio centric, but it should be noted that there exist video specific use cases as well. As time goes on, this document might pick up further use cases that are not audio specific.
The next section will textually describe those audio use-cases, for a more immersive "description" the reader is directed to https://youtu.be/fxsvVcIOiJA.
One basic NGA feature is selecting a so-called audio preselection. An audio preselection is a set of predefined mix parameters for the included content components. Such components, also known as audio objects, of an audio program, can for example be the dialogue object, or the background audio object.
A user can choose from a variety of preselections on the playback side to enable a basic form of adaptation to their personal preference for rendering these components.
Examples include:
With an NGA codec, content creators can enable gain adjustment for certain content components. Users may set these gain values to their preference, within limits chosen by the creator.
One example application is changing the gain of a spoken language component for better intelligibility, e.g., the commentator, [=Audio Description=], or the main dialogue of a movie. To offer this functionality, a content creator could author the metadata during production to allow gain interactivity for the component within a defined range (for example, between -6 dB and +12 dB).
With NGA, a content creator can allow position interactivity for individual content components. This feature can be used to further enhance [=Audio Description=] intelligibility. Users may use this functionality for better spatial separation between the main dialogue and the Audio Description.
Examples include:
NGA codecs enable the ability to select between multiple content components. This is typically a choice of different content alternatives, where only exactly one can be active at a time.
Examples include:
One enhanced NGA concept for better intelligibility of dialogue components in a TV program is called narrative importance, which allows better intelligibility by grouping audio content components into hierarchies based on their importance for understanding the scene.
For example, the highest hierarchy level contains elements that are essential for the narrative (dialogue, semantically rich effects, [=Audio Description=]), while lower hierarchy levels contain background music and non-essential effects. Multiple layers of importance can be created, with all necessary gain modifications applied simultaneously through a single control.
In the examples above, a number of requirements on the API are implicitly assumed:
The API MUST be codec-agnostic. There are many NGA codecs from competing companies that all offer all or parts of the personalization use cases outlined above. The API should work for all of these, such that implementation has the highest value for browser implementers as well as application developers.
The API MUST NOT preclude the use of protected media. Most NGA codecs are used for commercial content, much of which is rights protected and therefore DRM protected. In technical terms, this likely means that the API has to work when media is played back using EME.
The API MUST work when multiple media streams are being rendered/decoded. In most media playback scenarios, there is at least one video and one audio codec being used. Since personalization options are typically tied to a specific media stream, the API needs to be specific to individual media streams.
The API MUST be usable while media is playing and MUST be operable asynchronously to the media. When a personalization change is made, the user would expect the change to be effective immediately, not when the tip of the media queue is eventually rendered.
The API MUST NOT block on hardware/codec access. The personalization features often require setting parameters on either the decoder or the rendering pipeline, which may run in different threads or on different hardware. Calls to set or get personalization features SHOULD return immediately, likely with a promise.
This section explores whether existing APIs can be used to decode and personalize NGA content.
The most obvious alternative would be to use a [[WEBCODECS]] {{AudioDecoder}} to decode NGA streams and [[WEBAUDIO]] {{AudioNode}}s to mix the resulting components. However, this approach has several critical limitations that make it unsuitable for real-world NGA applications.
Beyond these concerns, there are further issues, as detailed in the following.
The industry has adopted an integrated stream delivery model where audio components and metadata are delivered together in a single stream. Content delivered this way would need to be decoded into several streams of data and metadata that can then be mixed using WebAudio components. This would require WebCodecs to be able to output more than one stream where one of those streams is time aligned metadata used for the mixing process.
Delivering audio components as separate streams might seem like a solution but presents significant drawbacks:
A JavaScript or WASM-based decoder with output passed to the Web Audio API would face the same metadata bridging problems as the WebCodecs approach, plus additional drawbacks:
Current web platform APIs such as WebCodecs [[WEBCODECS]] {{AudioEncoder}} and [[MEDIASTREAM-RECORDING]] {{MediaRecorder}} allow websites to record and encode audio. We do not propose to add support for Next Generation Audio codecs to such APIs at this time.
The editors would like to thank all contributors to this draft and the members of the W3C Media Entertainment Interest Group for their valuable feedback and contributions.