MediaStreamTrack Insertable Media Processing using Streams

Unofficial Proposal Draft,

This version:
https://w3c.github.io/mediacapture-insertable-streams/
Feedback:
public-webrtc@w3.org with subject line “[mediacapture-insertable-streams] … message topic …” (archives)
Issue Tracking:
GitHub
Editors:
(Google)
(Google)

Abstract

This API defines an API surface for manipulating the bits on MediaStreamTracks carrying raw data. NOT AN ADOPTED WORKING GROUP DOCUMENT.

Status of this document

1. Introduction

The [WEBRTC-NV-USE-CASES] document describes several functions that can only be achieved by access to media (requirements N20-N22), including, but not limited to:

These use cases further require that processing can be done in worker threads (requirement N23-N24).

This specification gives an interface inspired by [WEBCODECS] to provide access to such functionality.

This specification provides access to raw media, which is the output of a media source such as a camera, microphone, screen capture, or the decoder part of a codec and the input to the decoder part of a codec. The processed media can be consumed by any destination that can take a MediaStreamTrack, including HTML <video> and <audio> tags, RTCPeerConnection, canvas or MediaRecorder.

2. Terminology

3. Specification

This specification shows the IDL extensions for [MEDIACAPTURE-STREAMS]. It defines some new objects that inherit the MediaStreamTrack interface, and can be constructed from a MediaStreamTrack.

The API consists of two elements. One is a track sink that is capable of exposing the unencoded frames from the track to a ReadableStream, and exposes a control channel for signals going in the oppposite direction. The other one is the inverse of that: it takes video frames as input, and emits control signals that result from subsequent processing.

3.1. MediaStreamTrackProcessor interface

interface MediaStreamTrackProcessor {
    constructor(MediaStreamTrackProcessorInit init);
    attribute ReadableStream readable;  // VideoFrame or AudioFrame
    attribute WritableStream writableControl;  // MediaStreamTrackSignal
};

dictionary MediaStreamTrackProcessorInit {
  required MediaStreamTrack track;
}

3.1.1. Internal slots

[track],
Track whose raw data is to be exposed by the MediaStreamTrackProcessor.

3.1.2. Constructor

MediaStreamTrackProcessor(init)
  1. If init.track is not a valid MediaStreamTrack, throw a TypeError.

  2. Let p be a new MediaStreamTrackProcessor object.

  3. Assign init.track to the [track] internal slot of p.

  4. Return p.

3.1.3. Attributes

readable, of type ReadableStream
Allows reading the frames flowing through the MediaStreamTrack stored in the [track] internal slot. If [track] is a video track, chunks read from readable will be VideoFrame objects. If [track] is an audio track, chunks read from readable will produce AudioFrame objects.
writableControl, of type WritableStream
Allows sending control signals to [track]. Control signals are objects of type MediaStreamTrackSignal.

3.2. MediaStreamTrackGenerator interface

interface MediaStreamTrackGenerator : MediaStreamTrack {
    constructor(MediaStreamTrackGeneratorInit init)
    attribute WritableStream writable;  // VideoFrame or AudioFrame
    attribute ReadableStream readableControl;  // MediaStreamTrackSignal
};

dictionary MediaStreamTrackGeneratorInit {
  // At least one of the two fields must be set.
  // If both are provided and signalTarget.kind and kind do not match, the
  // MediaStreamTrackGenerator’s constructor will raise an exception.
  MediaStreamTrack signalTarget;
  MediaStreamTrackGeneratorKind kind;
}

3.2.1. Internal slots

[signalTarget],
(Optional) track to which the MediaStreamTrackGenerator will automatically forward control signals.
[kind],
Kind of media data for this MediaStreamTrackGenerator. Must be a valid MediaStreamTrackKind (i.e., "audio" or "video").

3.2.2. Constructor

MediaStreamTrackGenerator(init)
  1. If init.signalTarget is not empty and is not a valid MediaStreamTrack, or init.[kind] is not empty and it is not "audio" or "video", throw a TypeError.

  2. If neither init.signalTarget nor and init.[kind] are empty, and init.signalTarget.kind does not match init.kind, throw a TypeError.

  3. Let g be a new MediaStreamTrackGenerator object.

  4. If init.signalTarget is not empty, assign init.signalTarget to the [signalTarget] internal slot of g and initialize the kind field of g (inherited from MediaStreamTrack) with init.signalTarget.kind. Otherwise, initialize the kind field of g with init.[kind].

  5. Return g.

3.2.3. Attributes

writable,
Allows writing media frames the MediaStreamTrackGenerator, which is itself a MediaStreamTrack. If the [kind] attribute is "audio", the stream will accept AudioFrame objects and fail with any other type. If [kind] is "video", the stream will accept VideoFrame objects and fail with any other type. When a frame is written to writable, the frame’s close() method is automatically invoked, so that its internal resources are no longer accessible from JavaScript.
readableControl,
Allows reading control signals sent from any sinks connected to the MediaStreamTrackGenerator. Control signals are objects of type MediaStreamTrackSignal.

3.3. Stream control

dictionary MediaStreamTrackSignal {
  required MediaStreamSignalType signalType;
  double frameRate;
}

enum MediaStreamTrackSignalType {
  "request-frame"
  "set-min-frame-rate",
}

In the MediaStream model, apart from media, which flows from sources to sinks, there are also control signals that flow in the opposite direction (i.e., from sinks to sources via the track). A MediaStreamTrackProcessor is a sink and it allows sending control signals to its track and source via its writableControl field. A MediaStreamTrackGenerator is a track for which a custom source can be implemented by writing media frames to its writable field. Such a source can receive control signals sent by sinks via its readableControl field. Note that control signals are just hints that a sink can send to its track (or the source backing it). There is no obligation for a source or track to react to them.

Control signals are represented as MediaStreamTrackSignal objects. The MediaStreamTrackSignal dictionary has the following fields:

This set of control signals is intended to be extensible, so it is possible that new signal types and parameters may be added in the future. Note also that this set of control signals is not intended to cover all possible signaling that may occur between platform sinks and tracks/sources. A user agent implementation is free to implement any internal signaling between specific types of sinks and specific types of sources, and it would not make sense to expose all such specific signaling to generic Web platform objects like track generators and processors, as they can be considered implementation details that may differ significantly across browsers. It is, however, a requirement of this specification that it is possible to operate a MediaStreamTrackGenerator connected to a MediaStreamTrackProcessor using only the Web Platform-exposed signals and without connecting any implicit signalling.

3.3.1. Implicit signaling

A common use case for processors and generators is to connect the media flow from a pre-existing platform track (e.g., camera, screen capture) to pre-existing platform sinks (e.g., peer connection, media element) with a transformation in between, in a chain like this:

Platform Track -> Processor -> Transform -> Generator -> Platform Sinks

Absent the Breakout Box elements, the platform sinks would normally send the signals directly to the platform track (and its source). Arguably, in many cases the source of the platform track can still be considered the source for the platform sinks and it would be desirable to keep the original platform signaling even with the Processor -> Transform -> Generator chain between them. This can be achieved by assigning the platform track to the signalTarget field in MediaStreamTrackGeneratorInit. Note that such a connection between the platform track and the generator includes all possible internal signaling coming from the platform sinks and not just the generic signals exposed as MediaStreamTrackSignal objects via the readableControl and writableControl fields. Just connecting explicit signals from a generator to a processor (e.g., with a call like generator.readableControl.pipeTo(processor.writableControl) forwards only the Web Platform-exposed signals, but ignores any internal custom signals as there is no way for the platform sinks to know what the upstream source is.

4. Examples

Consider a face recognition function detectFace(videoFrame) that returns a face position (in some format), and a manipulation function blurBackground(videoFrame, facePosition) that returns a new VideoFrame similar to the given videoFrame, but with the non-face parts blurred.
let stream = await getUserMedia({video:true});
let videoTrack = stream.getVideoTracks()[0];
let trackProcessor = new TrackProcessor(videoTrack);
let trackGenerator = new TrackGenerator();
let transformer = new TransformStream({
   async transform(videoFrame, controller) {
      let facePosition = detectFace(videoFrame);
      let newFrame = blurBackground(videoFrame.data, facePosition);
      videoFrame.close();
      controller.enqueue(newFrame);
  }
});

// After this, trackGenerator can be assigned to any sink such as a
// peer connection, or media element.
trackProcessor.readable
    .pipeThrough(transformer)
    .pipeTo(trackGenerator.writable);

// Forward Web-exposed signals to the original videoTrack.
trackGenerator.readableControl.pipeTo(trackProcessor.writableControl);

5. Security and Privacy considerations

The security of this API relies on existing mechanisms in the Web platform. As data is exposed using the VideoFrame and AudioFrame interfaces, the rules of those interfaces to deal with origin-tained data apply. For example, data from cross-origin resources cannot be accessed due to existing restrictions to access such resources (e.g., it is not possible to access the pixels of a cross-origin image or video element). In addition to this, access to media data from cameras, microphones or the screen is subject to user authorization as specified in [MEDIACAPTURE-STREAMS] and [MEDIACAPTURE-SCREEN-SHARE].

The media data this API exposes is already available through other APIs (e.g., media elements + canvas + canvas capture). In addition to the media data, this API exposes some control signals such as requests for new frames. These signals are intended as hints and do not pose a significant security risk.

Conformance

Document conventions

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Conformant Algorithms

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps can be implemented in any manner, so long as the end result is equivalent. In particular, the algorithms defined in this specification are intended to be easy to understand and are not intended to be performant. Implementers are encouraged to optimize.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[MEDIACAPTURE-STREAMS]
Cullen Jennings; et al. Media Capture and Streams. 21 January 2021. CR. URL: https://www.w3.org/TR/mediacapture-streams/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[STREAMS]
Adam Rice; Domenic Denicola; 吉野剛史 (Takeshi Yoshino). Streams Standard. Living Standard. URL: https://streams.spec.whatwg.org/
[WebIDL]
Boris Zbarsky. Web IDL. 15 December 2016. ED. URL: https://heycam.github.io/webidl/

Informative References

[MEDIACAPTURE-SCREEN-SHARE]
Screen Capture. URL: https://w3c.github.io/mediacapture-screen-share/
[WEBCODECS]
WebCodecs. URL: https://wicg.github.io/web-codecs/
[WEBRTC-NV-USE-CASES]
Bernard Aboba. WebRTC Next Version Use Cases. 30 November 2020. WD. URL: https://www.w3.org/TR/webrtc-nv-use-cases/

IDL Index

interface MediaStreamTrackProcessor {
    constructor(MediaStreamTrackProcessorInit init);
    attribute ReadableStream readable;  // VideoFrame or AudioFrame
    attribute WritableStream writableControl;  // MediaStreamTrackSignal
};

dictionary MediaStreamTrackProcessorInit {
  required MediaStreamTrack track;
}

interface MediaStreamTrackGenerator : MediaStreamTrack {
    constructor(MediaStreamTrackGeneratorInit init)
    attribute WritableStream writable;  // VideoFrame or AudioFrame
    attribute ReadableStream readableControl;  // MediaStreamTrackSignal
};

dictionary MediaStreamTrackGeneratorInit {
  // At least one of the two fields must be set.
  // If both are provided and signalTarget.kind and kind do not match, the
  // MediaStreamTrackGenerator’s constructor will raise an exception.
  MediaStreamTrack signalTarget;
  MediaStreamTrackGeneratorKind kind;
}

dictionary MediaStreamTrackSignal {
  required MediaStreamSignalType signalType;
  double frameRate;
}

enum MediaStreamTrackSignalType {
  "request-frame"
  "set-min-frame-rate",
}