MediaStreamTrack Insertable Media Processing using Streams

1. Introduction

The [WEBRTC-NV-USE-CASES] document describes several functions that can only be achieved by access to media (requirements N20-N22), including, but not limited to:

Funny Hats
Machine Learning
Virtual Reality Gaming

These use cases further require that processing can be done in worker threads (requirement N23-N24).

This specification gives an interface based on [WEBCODECS] and [STREAMS] to provide access to such functionality.

This specification provides access to raw media, which is the output of a media source such as a camera, microphone, screen capture, or the decoder part of a codec and the input to the decoder part of a codec. The processed media can be consumed by any destination that can take a MediaStreamTrack, including HTML <video> tags, RTCPeerConnection, canvas or MediaRecorder.

This specification explicitly aims to support the following use cases:

Video processing: This is the "Funny Hats" use case, where the input is a single video track and the output is a transformed video track.
Custom video sink: In this use case, the purpose is not producing a processed MediaStreamTrack, but to consume the media in a different way. For example, an application could use [WEBCODECS] and [WEBTRANSPORT] to create an RTCPeerConnection-like sink, but using different codec configuration and networking protocols.
Multi-source processing: In this use case, two or more tracks are combined into one. For example, a presentation containing a live weather map and a camera track with the speaker can be combined to produce a weather report application.

Note: There is no WG consensus on whether or not audio use cases should be supported.

Note: The WG expects that the Streams spec will adopt the solutions outlined in the relevant explainer, to solve some issues with the current Streams specification.

2. Specification

This specification shows the IDL extensions for [MEDIACAPTURE-STREAMS]. It defines some new objects that inherit the MediaStreamTrack interface, and can be constructed from a MediaStreamTrack.

The API consists of two elements. One is a track sink that is capable of exposing the unencoded media frames from the track to a ReadableStream. The other one is the inverse of that: it provides a track source that takes media frames as input.

2.1. MediaStreamTrackProcessor

A MediaStreamTrackProcessor allows the creation of a ReadableStream that can expose the media flowing through a given MediaStreamTrack. If the MediaStreamTrack is a video track, the chunks exposed by the stream will be VideoFrame objects.

This makes MediaStreamTrackProcessor effectively a sink in the MediaStream model.

A MediaStreamTrackProcessor internally contains a circular queue that allows buffering incoming media frames delivered by the track it is connected to. This buffering allows the MediaStreamTrackProcessor to temporarily hold frames waiting to be read from its associated ReadableStream. The application can influence the maximum size of the queue via a parameter provided in the MediaStreamTrackProcessor constructor. However, the maximum size of the queue is decided by the UA and can change dynamically, but it will not exceed the size requested by the application. If the application does not provide a maximum size parameter, the UA is free to decide the maximum size of the queue.

When a new frame arrives to the MediaStreamTrackProcessor, if the queue has reached its maximum size, the oldest frame will be removed from the queue, and the new frame will be added to the queue. This means that for the particular case of a queue with a maximum size of 1, if there is a queued frame, it will aways be the most recent one.

The UA is also free to remove any frames from the queue at any time. The UA may remove frames in order to save resources or to improve performance in specific situations. In all cases, frames that are not dropped must be made available to the ReadableStream in the order in which they arrive to the MediaStreamTrackProcessor.

A MediaStreamTrackProcessor makes frames available to its associated ReadableStream only when a read request has been issued on the stream. The idea is to avoid the stream’s internal buffering, which does not give the UA enough flexibility to choose the buffering policy.

2.1.1. Interface definition

[Exposed=DedicatedWorker]
interface MediaStreamTrackProcessor {
    constructor(MediaStreamTrackProcessorInit init);
    readonly attribute ReadableStream readable;
};

dictionary MediaStreamTrackProcessorInit {
  required MediaStreamTrack track;
  [EnforceRange] unsigned short maxBufferSize;
};

Note: There is WG consensus that the interface should be exposed on DedicatedWorker. There is no WG consensus on whether or not the interface should not be exposed on Window.

Note: There is consensus in the WG that creating a MediaStreamTrackProcessor from a MediaStreamTrack of kind "video" should exist. There is no WG consensus on whether or not creating a MediaStreamTrackProcessor from a MediaStreamTrack of kind "audio" should be supported.

2.1.2. Internal slots

[[track]]: Track whose raw data is to be exposed by the MediaStreamTrackProcessor.
[[maxBufferSize]]: The maximum number of media frames to be buffered by the MediaStreamTrackProcessor as specified by the application. It may have no value if the application does not provide it. Its minimum valid value is 1.
[[queue]]: A queue used to buffer media frames not yet read by the application
[[numPendingReads]]: An integer whose value represents the number of read requests issued by the application that have not yet been handled.
[[isClosed]]: An boolean whose value indicates if the MediaStreamTrackProcessor is closed.

2.1.3. Constructor

MediaStreamTrackProcessor(init)

If init.track is not a valid MediaStreamTrack, throw a TypeError.
Let maxBufferSize be 1.
If init.maxBufferSize has an integer value greater than 1, run the following substeps:
1. Set maxBufferSize to init.maxBufferSize.
2. The user agent MAY decide to clamp maxBufferSize to a lower value, but no lower than 1.
  
  Clamping maxBufferSize can be useful for some sources like cameras, for instance in case they can only use a limited number of VideoFrames at any given time.
Let processor be a new MediaStreamTrackProcessor object.
Set processor.[[track]] to init.track.
Set processor.[[maxBufferSize]] to maxBufferSize.
Set processor.[[queue]] to an empty queue.
Set processor.[[numPendingReads]] to 0.
Set processor.[[isClosed]] to false.
Return processor.

2.1.4. Attributes

readable, of type ReadableStream, readonly

Allows reading the frames delivered by the MediaStreamTrack stored in the [[track]] internal slot. This attribute is created the first time it is invoked according to the following steps:

Initialize this.readable to be a new ReadableStream.
Set up this.readable with its pullAlgorithm set to processorPull with this as parameter, cancelAlgorithm set to processorCancel with this as parameter, and highWatermark set to 0.

The processorPull algorithm is given a processor as input. It is defined by the following steps:

Increment the value of the processor.[[numPendingReads]] by 1.
Queue a task to run the maybeReadFrame algorithm with processor as parameter.
Return a promise resolved with undefined.

The maybeReadFrame algorithm is given a processor as input. It is defined by the following steps:

If processor.[[queue]] is empty, abort these steps.
If processor.[[numPendingReads]] equals zero, abort these steps.
Let frame be the result of dequeueing a frame media data from processor.[[queue]].
Enqueue frame in processor.readable.
Decrement processor.[[numPendingReads]] by 1.
Go to step 1.

The processorCancel algorithm is given a processor as input. It is defined by running the following steps:

Run the processorClose algorithm with processor as parameter.
Return a promise resolved with undefined.

The processorClose algorithm is given a processor as input. It is defined by running the following steps:

If processor.[[isClosed]] is true, abort these steps.
Disconnect processor from processor.[[track]]. The mechanism to do this is UA specific and the result is that processor is no longer a sink of processor.[[track]].
Close processor.readable.[[controller]].
Empty processor.[[queue]].
Set processor.[[isClosed]] to true.

2.1.5. Handling interaction with the track

When the [[track]] of a MediaStreamTrackProcessor processor delivers a frame to processor, the UA MUST execute the handleNewFrame algorithm with processor as parameter.

The handleNewFrame algorithm is given a processor as input. It is defined by running the following steps:

If processor.[[queue]] has processor.[[maxBufferSize]] elements, run the following steps:
1. Let droppedFrame be the result of dequeueing processor.[[queue]].
2. Run the Close VideoFrame algorithm with droppedFrame.
Enqueue the new frame media data in processor.[[queue]].
Queue a task to run the maybeReadFrame algorithm with processor as parameter.

At any time, the UA MAY remove any frame from processor.[[queue]]. The UA may decide to remove frames from processor.[[queue]], for example, to prevent resource exhaustion or to improve performance in certain situations.

The application may detect that frames have been dropped by noticing that there is a gap in the timestamps of the frames.

When the [[track]] of a MediaStreamTrackProcessor processor ends, the processorClose algorithm must be executed with processor as parameter.

2.2. VideoTrackGenerator

A VideoTrackGenerator allows the creation of a video source for a MediaStreamTrack in the MediaStream model that generates its frames from a Stream of VideoFrame objects. It has two readonly attributes: a writable WritableStream and a track MediaStreamTrack.

The VideoTrackGenerator is the underlying sink] of its writable attribute. The track attribute is the output. Further tracks connected to the same VideoTrackGenerator can be created using the clone method on the track attribute.

The WritableStream accepts VideoFrame objects. When a VideoFrame is written to writable, the frame’s close() method is automatically invoked, so that its internal resources are no longer accessible from JavaScript.

Note: There is consensus in the WG that a source capable of generating a MediaStreamTrack of kind "video" should exist. There is no WG consensus on whether or not a source capable of generating a MediaStreamTrack of kind "audio" should exist.

2.2.1. Interface definition

[Exposed=DedicatedWorker]
interface VideoTrackGenerator {
  constructor();
  readonly attribute WritableStream writable;
  attribute boolean muted;
  readonly attribute MediaStreamTrack track;
};

Note: There is WG consensus that this interface should be exposed on DedicatedWorker. There is no WG consensus on whether or not it should be exposed on Window.

2.2.2. Internal slots

[[track]]: The MediaStreamTrack output of this source
[[isMuted]]: A boolean whose value indicates whether this source and all the MediaStreamTracks it sources, are currently muted or not.

2.2.3. Constructor

VideoTrackGenerator()

Let generator be a new VideoTrackGenerator object.
Let track be a newly created MediaStreamTrack with source set to generator and tieSourceToContext set to false.
Initialize generator.track to track.
Return generator.

2.2.4. Attributes

writable, of type WritableStream, readonly

Allows writing video frames to the VideoTrackGenerator. When this attribute is accessed for the first time, it MUST be initialized with the following steps:

Initialize this.writable to be a new WritableStream.
Set up this.writable, with its writeAlgorithm set to writeFrame with this as parameter, with closeAlgorithm set to closeWritable with this as parameter and abortAlgorithm set to closeWritable with this as parameter.

The writeFrame algorithm is given a generator and a frame as input. It is defined by running the following steps:

If frame is not a VideoFrame object, return a promise rejected with a TypeError.
If the value of frame’s [[Detached]] internal slot is true, return a promise rejected with a TypeError.
If generator.[[isMuted]] is false, for each live track sourced from generator, named track, run the following steps:
1. Let clone be the result of running the Clone videoFrame algorithm with frame.
2. Send clone to track.
Run the Close VideoFrame algorithm with frame.
Return a promise resolved with undefined.

When the media data is sent to a track, the UA may apply processing (e.g., cropping and downscaling) to ensure that the media data sent to the track satisfies the track’s constraints. Each track may receive a different version of the media data depending on its constraints.

The closeWritable algorithm is given a generator as input. It is defined by running the following steps.

For each track t sourced from generator, end t.
Return a promise resolved with undefined.

muted, of type boolean

Mutes the VideoTrackGenerator. The getter steps are to return this.[[isMuted]]. The setter steps, given a value newValue, are as follows:

If newValue is equal to this.[[isMuted]], abort these steps.
Set this.[[isMuted]] to newValue.
Unless one has been queued already this run of the event loop, queue a task to run the following steps:
1. Let settledValue be this.[[isMuted]].
2. For each live track sourced by this, queue a task to set a track’s muted state to settledValue.

track, of type MediaStreamTrack, readonly

The MediaStreamTrack output. The getter steps are to return this.[[track]].

2.2.5. Specialization of MediaStreamTrack behavior

A VideoTrackGenerator acts as the source for one or more MediaStreamTracks. This section adds clarifications on how a MediaStreamTrack sourced from a VideoTrackGenerator behaves.

2.2.5.1. stop

The stop method stops the track. When the last track sourced from a VideoTrackGenerator ends, that VideoTrackGenerator's writable is closed.

2.2.5.2. Constrainable properties

The following constrainable properties are defined for any MediaStreamTracks sourced from a VideoTrackGenerator:

Property Name	Values	Notes
width	`ConstrainULong`	As a setting, this is the width, in pixels, of the latest frame received by the track. As a capability, `max` MUST reflect the largest width a `VideoFrame` may have, and `min` MUST reflect the smallest width a `VideoFrame` may have.
height	`ConstrainULong`	As a setting, this is the height, in pixels, of the latest frame received by the track. As a capability, `max` MUST reflect the largest height a `VideoFrame` may have, and `min` MUST reflect the smallest height a `VideoFrame` may have.
frameRate	`ConstrainDouble`	As a setting, this is an estimate of the frame rate based on frames recently received by the track. As a capability `min` MUST be zero and `max` MUST be the maximum frame rate supported by the system.
aspectRatio	`ConstrainDouble`	As a setting, this is the aspect ratio of the latest frame delivered by the track; this is the width in pixels divided by height in pixels as a double rounded to the tenth decimal place. As a capability, `min` MUST be the smallest aspect ratio supported by a `VideoFrame`, and `max` MUST be the largest aspect ratio supported by a `VideoFrame`.
resizeMode	`ConstrainDOMString`	As a setting, this string should be one of the members of `VideoResizeModeEnum`. The value "`none`" means that the frames output by the MediaStreamTrack are unmodified versions of the frames written to the `writable` backing the track, regardless of any constraints. The value "`crop-and-scale`" means that the frames output by the MediaStreamTrack may be cropped and/or downscaled versions of the source frames, based on the values of the width, height and aspectRatio constraints of the track. As a capability, the values "`none`" and "`crop-and-scale`" both MUST be present.

The applyConstraints method applied to a video MediaStreamTrack sourced from a VideoTrackGenerator supports the properties defined above. It can be used, for example, to resize frames or adjust the frame rate of the track. Note that these constraints have no effect on the VideoFrame objects written to the writable of a VideoTrackGenerator, just on the output of the track on which the constraints have been applied. Note also that, since a VideoTrackGenerator can in principle produce media data with any setting for the supported constrainable properties, an applyConstraints call on a track backed by a VideoTrackGenerator will generally not fail with OverconstrainedError unless the given constraints are outside the system-supported range, as reported by getCapabilities.

2.2.5.3. Events and attributes

Events and attributes work the same as for any MediaStreamTrack. It is relevant to note that if the writable stream of a VideoTrackGenerator is closed, all the live tracks connected to it are ended and the ended event is fired on them.

3. Examples

3.1. Video Processing

Consider a face recognition function detectFace(videoFrame) that returns a face position (in some format), and a manipulation function blurBackground(videoFrame, facePosition) that returns a new VideoFrame similar to the given videoFrame, but with the non-face parts blurred. The example also shows the video before and after effects on video elements.

// main.js

const stream = await navigator.mediaDevices.getUserMedia({video:true});
const videoBefore = document.getElementById('video-before');
const videoAfter = document.getElementById('video-after');
videoBefore.srcObject = stream.clone();

const [track] = stream.getVideoTracks();
const worker = new Worker('worker.js');
worker.postMessage({track}, [track]);

const {data} = await new Promise(r => worker.onmessage);
videoAfter.srcObject = new MediaStream([data.track]);

// worker.js

self.onmessage = async ({data: {track}}) => {
  const source = new VideoTrackGenerator();
  parent.postMessage({track: source.track}, [source.track]);

  const {readable} = new MediaStreamTrackProcessor({track});
  const transformer = new TransformStream({
    async transform(frame, controller) {
      const facePosition = await detectFace(frame);
      const newFrame = blurBackground(frame, facePosition);
      frame.close();
      controller.enqueue(newFrame);
    }
  });
  await readable.pipeThrough(transformer).pipeTo(source.writable);
};

3.2. Multi-consumer post-processing with constraints

A common use case is to remove the background from live camera video fed into a video conference, with a live self-view showing the result. It’s desirable for the self-view to have a high frame rate even if the frame rate used for actual sending may dip lower due to back pressure from bandwidth constraints. This can be achieved by applying constraints to a track clone, avoiding having to process twice.

// main.js

const stream = await navigator.mediaDevices.getUserMedia({video:true});
const [track] = stream.getVideoTracks();
const worker = new Worker('worker.js');
worker.postMessage({track}, [track]);

const {data} = await new Promise(r => worker.onmessage);
const selfView = document.getElementById('video-self');
selfView.srcObject = new MediaStream([data.track.clone()]); // 60 fps

await data.track.applyConstraints({width: 320, height: 200, frameRate: 30});
const pc = new RTCPeerConnection(config);
pc.addTrack(data.track); // 30 fps

// worker.js

self.onmessage = async ({data: {track}}) => {
  const source = new VideoTrackGenerator();
  parent.postMessage({track: source.track}, [source.track]);

  const {readable} = new MediaStreamTrackProcessor({track});
  const transformer = new TransformStream({transform: myRemoveBackgroundFromVideo});
  await readable.pipeThrough(transformer).pipeTo(source.writable);
};

3.3. Multi-consumer post-processing with constraints in a worker

Being able to show a higher frame-rate self-view is also relevant when sending video frames over WebTransport in a worker. The same technique above may be used here, except constraints are applied to a track clone in the worker.

// main.js

const stream = await navigator.mediaDevices.getUserMedia({video:true});
const [track] = stream.getVideoTracks();
const worker = new Worker('worker.js');
worker.postMessage({track}, [track]);

const {data} = await new Promise(r => worker.onmessage);
const selfView = document.getElementById('video-self');
selfView.srcObject = new MediaStream([data.track]); // 60 fps

// worker.js

self.onmessage = async ({data: {track}}) => {
  const source = new VideoTrackGenerator();
  const sendTrack = source.track.clone();
  parent.postMessage({track: source.track}, [source.track]);

  await sendTrack.applyConstraints({width: 320, height: 200, frameRate: 30});

  const wt = new WebTransport("https://webtransport.org:8080/up");

  const {readable} = new MediaStreamTrackProcessor({track});
  const transformer = new TransformStream({transform: myRemoveBackgroundFromVideo});
  await readable.pipeThrough(transformer)
    .pipeThrough({writable: source.writable, readable: sendTrack.readable}),
    .pipeThrough(createMyEncodeVideoStream({
      codec: "vp8",
      width: 640,
      height: 480,
      bitrate: 1000000,
    }))
    .pipeThrough(new TransformStream({transform: mySerializer}));
    .pipeTo(wt.createUnidirectionalStream()); // 30 fps
};

The above example avoids using the tee() function to serve multiple consumers, due to its issues with real-time streams.

For brevity, the example also over-simplifies using a WebCodecs wrapper to encode and send video frames over a single WebTransport stream (incurring head-of-line blocking).

4. Implementation advice

This section is informative.

4.1. Use with multiple consumers

There are use cases where the programmer may desire that a single stream of frames is consumed by multiple consumers.

Examples include the case where the result of a background blurring function should be both displayed in a self-view and encoded using a VideoEncoder.

For cases where both consumers are consuming unprocessed frames, and synchronization is not desired, instantianting multiple MediaStreamTrackProcessor objects is a robust solution.

For cases where both consumers intend to convert the result of a processing step into a MediaStreamTrack using a VideoTrackGenerator, for example when feeding a processed stream to both a <video> tag and an RTCPeerConnection, attaching the resulting MediaStreamTrack to multiple sinks may be the most appropriate mechanism.

For cases where the downstream processing takes frames, not streams, the frames can be cloned as needed and sent off to the downstream processing; "clone" is a cheap operation.

When the stream is the output of some processing, and both branches need a Stream object to do further processing, one needs a function that produces two streams from one stream.

However, the standard tee() operation is problematic in this context:

It defeats the backpressure mechanism that guards against excessive queueing
It creates multiple links to the same buffers, meaning that the question of which consumer gets to destroy() the buffer is a difficult one to address

Therefore, the use of tee() with Streams containing media should only be done when fully understanding the implications. Instead, custom elements for splitting streams more appropriate to the use case should be used.

If both branches require the ability to dispose of the frames, clone() the frame and enqueue distinct copies in both queues. This corresponds to the function ReadableStreamTee(stream, cloneForBranch2=true). Then choose one of the alternatives below.
If one branch requires all frames, and the other branch tolerates dropped frames, enqueue buffers in the all-frames-required stream and use the backpressure signal from that stream to stop reading from the source. If backpressure signal from the other stream indicates room, enqueue the same frame in that queue too.
If neither stream tolerates dropped frames, use the combined backpressure signal to stop reading from the source. In this case, frames will be processed in lockstep if the buffer sizes are both 1.
If it is OK for the incoming stream to be stalled only when the underlying buffer pool allocated to the process is exhausted, standard tee() may be used.

Note: There are issues filed on the Streams spec where the resolution might affect this section: https://github.com/whatwg/streams/issues/1157, https://github.com/whatwg/streams/issues/1156, https://github.com/whatwg/streams/issues/401, https://github.com/whatwg/streams/issues/1186

5. Security and Privacy considerations

This API defines a MediaStreamTrack source and a MediaStreamTrack sink. The security and privacy of the source (VideoTrackGenerator) relies on the same-origin policy. That is, the data VideoTrackGenerator can make available in the form of a MediaStreamTrack must be visible to the document before a VideoFrame object can be constructed and pushed into the VideoTrackGenerator. Any attempt to create VideoFrame objects using cross-origin data will fail. Therefore, VideoTrackGenerator does not introduce any new fingerprinting surface.

The MediaStreamTrack sink introduced by this API (MediaStreamTrackProcessor) exposes MediaStreamTrack the same data that is exposed by other MediaStreamTrack sinks such as WebRTC peer connections, and media elements. The security and privacy of MediaStreamTrackProcessor relies on the security and privacy of the MediaStreamTrack sources of the tracks to which MediaStreamTrackProcessor is connected. For example, camera, microphone and screen-capture tracks rely on explicit use authorization via permission dialogs (see [MEDIACAPTURE-STREAMS] and [SCREEN-CAPTURE]), while element capture and VideoTrackGenerator rely on the same-origin policy.

A potential issue with MediaStreamTrackProcessor is resource exhaustion. For example, a site might hold on to too many open VideoFrame objects and deplete a system-wide pool of GPU-memory-backed frames. UAs can mitigate this risk by limiting the number of pool-backed frames a site can hold. This can be achieved by reducing the maximum number of buffered frames and by refusing to deliver more frames to readable once the budget limit is reached. Accidental exhaustion is also mitigated by automatic closing of VideoFrame objects once they are written to a VideoTrackGenerator.

6. Backwards compatibility with earlier proposals

This section is informative.

Previous proposals for this interface had an API like this:

[Exposed=Window,DedicatedWorker]
interface MediaStreamTrackGenerator : MediaStreamTrack {
    constructor(MediaStreamTrackGeneratorInit init);
    attribute WritableStream writable;  // VideoFrame or AudioData
};

dictionary MediaStreamTrackGeneratorInit {
  required DOMString kind;
};

This interface had the generator for the MediaStreamTrack being an instance of a MediaStreamTrack rather than containing one.

The VideoTrackGenerator can be shimmed on top of MediaStreamTrackGenerator like this:

// Not tested, unlikely to work as written!
class VideoTrackGenerator {
  constructor() {
     this.innerGenerator = new MediaStreamTrackGenerator({kind: 'video'});
     this.writable = this.innerGenerator.writable;
     this.track = this.innerGenerator.clone();
  }
  // Missing: shim for setting of the "muted" attribute.
};

Further description of the previous proposals, including considerations involving processing of audio, can be found in earlier versions of this document.

Note: A link will be placed here pointing to the chrome-96 branch when we have finished moving repos about.

MediaStreamTrack Insertable Media Processing using Streams

Abstract

Status of this document