This specification defines new VideoMonitor and VideoProcessor interfaces, and extends MediaStreamTrack and ImageBitmap interfaces to enable processing of video frames using script in a performant manner.

Work on this document has been discontinued and it should not be referenced or used as a basis for implementation.

Introduction

The HTML specification [[!HTML]] defines the ImageBitmap interface that represents a bitmap image. The Media Capture and Streams specification [[!GETUSERMEDIA]] allows a web page to access video streams sourced from cameras that are exposed to the underlying platform and defines the MediaStreamTrack interface that represents a media source. This specification extends these interfaces to define a model for processing video frames sourced from media stream tracks by a script in a performant manner. This enables use cases such as video editing, digital image processing, and object recognition among other advanced media usages on the Web Platform in a performant manner.

In this model, a video monitor or video processor is associated with an input media stream track. Events containing input frame data from the input media stream track are dispatched at the video monitor or video processor. A Web developer is able to monitor and process the input frame and produce an output frame using script. Finally, the provided output frame is fed into the output media stream track on the main thread. The output media stream track can then be used to construct a media stream and, for example, provided to media sinks such as <video> for rendering.

The design principle of this push-like mechanism for video processing is depicted in the following figure:

The relationship between Worker and MediaStreamTrack

Use cases and requirements

This specification attempts to address the Use Cases and Requirements for expanding the Web Platform to support image processing and computer vision usages.

Conformance

This specification defines conformance criteria that apply to a single product: the user agent that implements the interfaces that it contains.

Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification [[!WEBIDL]], as this specification uses that specification and terminology.

The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [[!RFC2119]]

Dependencies

The MediaStreamTrack and MediaStream interfaces this specification extends and the source concept are defined in [[!GETUSERMEDIA]].

The ImageBitmap interface and ImageBitmapFactories interface this specification extends are defined in [[!HTML51]].

The BufferSource and the ArrayBuffer are defined in [[!WEBIDL]]

The Promise object is defined in [[!ECMASCRIPT]].

The following concepts and interfaces are defined in [[!HTML]]:

Terminology

Video monitor
Captures an input frame from an input media stream track. A video monitor is said to be monitoring, if the video monitor is associated with an input media stream track. Video monitor events are dispatched at a video monitor.
Video processor
Similarly captures an input frame from an input media stream track, and in addition, processes the provided output frame for use as the source for the output media stream track. A video processor is said to be processing, if the video processor is associated with both an input media stream track and an output media stream track. Video processor events are dispatched at a video processor.

An input frame is an ImageBitmap object representing a frame sourced from the input media stream track associated with video monitor or video processor.

An output frame is an ImageBitmap object assigned to the outputImageBitmap attribute of the VideoProcessorEvent dispatched at video processor.

The input media stream track is the MediaStreamTrack object on which the addVideoMonitor() or addVideoProcessor() method is invoked.

The output media stream track is a MediaStreamTrack object returned by the invocation of the addVideoProcessor() method.

There are two kinds of video events:

Video monitor events
These events are represented by VideoMonitorEvent objects that are dispatched at a VideoMonitor and provide read access to input frame data.
Video processor events
These events are represented by VideoProcessorEvent objects that are dispatched at a VideoProcessor and provide read access to input frame data and write access to output frame data.

VideoMonitor and VideoProcessor interfaces

The VideoMonitor interface is used to analyze and VideoProcessor interface in addition to generate and process video data directly using JavaScript. The video monitor events are dispatched at VideoMonitor objects, and video processor events are dispatched at VideoProcessor objects.

          [Constructor]
          interface VideoMonitor : EventTarget {
              attribute EventHandler onvideomonitorchange;
          };
        
          [Constructor]
          interface VideoProcessor : VideoMonitor {
              attribute EventHandler onvideoprocessorchange;
          };
        

The following are the event handlers (and their corresponding event handler event types) that must be supported, as event handler IDL attributes, by objects implementing the VideoMonitor interface:

Event handlers Event handler event type
onvideomonitorchange videomonitorchange

The following are the event handlers (and their corresponding event handler event types) that must be supported, as event handler IDL attributes, by objects implementing the VideoProcessor interface:

Event handlers Event handler event type
onvideoprocessorchange videoprocessorchange

Video event firing

To fire a video event named e, the user agent must run the following steps:

  1. If e is videomonitorchange, create a VideoMonitorEvent, initialize it to have a given name e, to not bubble, to not be cancelable.
  2. If e is videoprocessorchange, create a VideoProcessorEvent, initialize it to have a given name e, to not bubble, to not be cancelable.
  3. Initialize the trackId attribute to the value of the id attribute of the input media stream track.
  4. Initialize the playbackTime attribute to the value of the currentTime attribute of the MediaStream that contains the input media stream track.
  5. Initialize the inputImageBitmap attribute to the bitmap data of the input media stream track's video frame at the current stream position at playbackTime.
  6. If e is videomonitorchange, dispatch the newly created VideoMonitorEvent object at each video monitor that is monitoring.
  7. If e is videoprocessorchange, initialize the outputImageBitmap attribute to null, and dispatch the newly created VideoProcessorEvent object at each target.

VideoMonitorEvent interface

The video monitor event contains an input frame and its metadata originating from the input media stream track. It uses the VideoMonitorEvent interface for its videomonitorchange events:

          [Constructor(DOMString type, optional VideoMonitorEventInit videoMonitorEventInitDict)]
          interface VideoMonitorEvent : Event {
              readonly    attribute DOMString   trackId;
              readonly    attribute double      playbackTime;
              readonly    attribute ImageBitmap? inputImageBitmap;
          };
          dictionary VideoMonitorEventInit : EventInit {
              required DOMString    trackId;
              required double       playbackTime;
              required ImageBitmap? inputImageBitmap;
          };
        

The trackId attribute must return the value it was initialized to. When the object is created, this attribute must be initialized to the empty string. It represents the identifier that is shared with the video monitor and its input media stream track.

The playbackTime attribute must return the value it was initialized to. When the object is created, this attribute must be initialized to zero. It represents the current stream position, in seconds.

The inputImageBitmap attribute must return the value it was initialized to. When the object is created, this attribute must be initialized to null. It represents the ImageBitmap object whose bitmap data is provided by the input media stream track.

When a user agent is required to fire a video monitor event, it must fire a video event named videomonitor.

VideoProcessorEvent interface

The video processor event inherits from the video monitor event, and in addition provides means to programmatically construct an output frame for the output media stream track. It uses the VideoProcessorEvent interface for its videoprocessorchange events:

          [Constructor(DOMString type, optional VideoProcessorEventInit videoProcessorEventInitDict)]
          interface VideoProcessorEvent : VideoMonitorEvent {
                          attribute ImageBitmap? outputImageBitmap;
          };

          dictionary VideoProcessorEventInit : VideoMonitorEventInit {
              required ImageBitmap? outputImageBitmap;
          };
        

The outputImageBitmap attribute must return the value it was initialized to. When the object is created, this attribute must be initialized to null. It represents the ImageBitmap object which on setting, must cause the user agent to run the following steps:

  1. Let output bitmap be the ImageBitmap object assigned to the outputImageBitmap attribute.
  2. Let output media stream track be the MediaStreamTrack returned by the addVideoProcessor() method.
  3. Set output bitmap as the source of the output media stream track.

When a user agent is required to fire a video processor event, it must fire a video event named videoprocessorchange.

Ideally the MediaStreamTrack should dispatch each video frame through VideoProcessorEvent. But sometimes the worker thread could not process the frame in time. So the implementation could skip the frame to avoid high memory footprint. In such case, we might not be able to process every frame in a real time MediaStream.

MediaStreamTrack extensions

MediaStreamTrack interface

          partial interface MediaStreamTrack {
              void              addVideoMonitor(VideoMonitor monitor);
              void              removeVideoMonitor(VideoMonitor monitor);
              MediaStreamTrack  addVideoProcessor(VideoProcessor processor);
              void              removeVideoProcessor();
          };
        

The addVideoMonitor() method, when invoked, must run these steps:

  1. Let monitor be the first method argument.
  2. Let track be the MediaStreamTrack object on which the method was invoked. (This is the input media stream track.)
  3. Associate monitor with input media stream track track.

The removeVideoMonitor() method, when invoked, must run these steps:

  1. Let monitor be the first method argument.
  2. Let track be the MediaStreamTrack object on which the method was invoked.
  3. If there exists an association between monitor and track, break that association.

The addVideoProcessor() method, when invoked, must run these steps:

  1. Let processor be the first method argument.
  2. Let track be the MediaStreamTrack object on which the method was invoked. (This is the input media stream track.)
  3. Associate processor with input media stream track track.
  4. Let new track be a newly created MediaStreamTrack object. (This is the output media stream track.)
  5. Associate new track as the output media stream track for processor.
  6. Return new track.

The removeVideoProcessor() method, when invoked, must run these steps:

  1. Let processor be the first method argument.
  2. Let track be the MediaStreamTrack object on which the method was invoked.
  3. If there exists an association between processor and track, break that association.

ImageBitmap extensions

The ImageBitmap interface is originally designed as a pure opaque handler to an image data buffer inside a browser so that how the browser stores the buffer is uknown to users and optimized to platforms. In this specification, we chooses ImageBitmap (instead of ImageData) as the container of video frames because the decoded video frame data might exist in either CPU or GPU memory which perfectly matches the nature of ImageBitmap as an opaque handler.

Considering how would developers process video frames, there are two possible approaches, via pure JavaScript(/asm.js) code or via WebGL.

  1. If developers use JavaScript(/asm.js) to process the frames, then the ImageBitmap interface needs to be extended with APIs for developers to access its underlying data and there should also be a way for developers to create an ImageBitmap from the processed data.
  2. If developers use WebGL, then WebGL needs to be extended so that developers can pass an ImageBitmap into the WebGL context and the browser will handle how to upload the raw image data into the GPU memory. Possibly, the data is already in the GPU memory so that the operation could be very efficient.

In this specification, the original ImageBitmap interface is extended with three methods to let developers read data from an ImageBitmap object into a given BufferSource in a set of supported ImageFormats. How the accessed image data is arranged in memory is described by the proposed ImagePixelLayout and dictionary ChannelPixelLayout. Also, the ImageBitmapFactories interface is extended to let developers create an ImageBitmap object from a given BufferSource.

Image format

An image or a video frame is conceptually a two-dimentional array of data and each element in the array is called a pixel. The pixels are usually stored in a one-dimensional array and could be arranged in a variety of image formats. Developers need to know how the pixels are formatted so that they are able to process them.

The image format describes how pixels in an image are arranged. A single pixel has at least one, but usually multiple pixel values. The range of a pixel value varies, which means different image formats use different data types to store a single pixel value.

The most frequently used data type is 8-bit unsigned integer whose range is from 0 to 255, others could be 16-bit integer or 32-bit floating points and so forth. The number of pixel values of a single pixel is called the number of channels of the image format. Multiple pixel values of a pixel are used together to describe the captured property which could be color or depth information. For example, if the data is a color image in RGB color space, then it is a three-channel image format and a pixel is described by R, G and B three pixel values with range from 0 to 255. As another example, if the data is a gray image, then it is a single-channel image format with 8-bit unsigned integer data type and the pixel value describes the gray scale. For depth data, it is a single channel image format too, but the data type is 16-bit unsigned integer and the pixel value is the depth level.

For those image formats whose pixels contain multiple pixel values, the pixel values might be arranged in one of the following ways:

  1. Planar pixel layout: each channel has its pixel values stored consecutively in separated buffers (a.k.a. planes) and then all channel buffers are stored consecutively in memory. (Ex: RRRRRR......GGGGGG......BBBBBB......)
  2. Interleaving pixel layout: each pixel has its pixel values from all channels stored together and interleaves all channels. (Ex: RGBRGBRGBRGBRGB......)

Image formats that belong to the same color space might have different pixel layouts.

ImageFormat enumeration

The ImageFormat enumeration is used to select the image format for the ImagePixelLayout. The ImageBitmap extensions defined in this specification use this enumeration to negotiate the image format while accessing the underlying data of ImageBitmap and creating a new ImageBitmap.

We need to elaborate this list for standardization.

            enum ImageFormat {
                "RGBA32",
                "BGRA32",
                "RGB24",
                "BGR24",
                "GRAY8",
                "YUV444P",
                "YUV422P",
                "YUV420P",
                "YUV420SP_NV12",
                "YUV420SP_NV21",
                "HSV",
                "Lab",
                "DEPTH",
                /* empty string */ ""
            };
          
N/A N/A
ImageFormat Channel order Channel size Pixel layout Data type
RGBA32 R, G, B, A full rgba-channels interleaving rgba-channels 8-bit unsigned integer
BGRA32 B, G, R, A full bgra-channels interleaving bgra-channels 8-bit unsigned integer
RGB24 R, G, B full rgb-channels interleaving rgb-channels 8-bit unsigned integer
BGR24 B, G, R full bgr-channels interleaving bgr-channels 8-bit unsigned integer
GRAY8 GRAY full gray-channel planar gray-channel 8-bit unsigned integer
YUV444P Y, U, V full yuv-channels planar yuv-channels 8-bit unsigned integer
YUV422P Y, U, V full y-channel, half uv-channels planar yuv-channels 8-bit unsigned integer
YUV420P Y, U, V full y-channel, quarter uv-channels planar yuv-channels 8-bit unsigned integer
YUV420SP_NV12 Y, U, V full y-channel, quarter uv-channels planar y-channel, interleaving uv-channels 8-bit unsigned integer
YUV420SP_NV21 Y, V, U full y-channel, quarter uv-channels planar y-channel, interleaving vu-channels 8-bit unsigned integer
HSV H, S, V full hsv-channels interleaving hsv-channels 32-bit IEEE floating point number
Lab l, a, b full lab-channels interleaving lab-channels 32-bit IEEE floating point number
DEPTH DEPTH full depth-channel planar depth-channel 16-bit unsigned integer
"" (the empty string) N/A N/A N/A N/A

ChannelPixelLayoutDataType enumeration

The ChannelPixelLayoutDataType enumeration is used to select the channel data type that is used to store a single pixel value.

            enum ChannelPixelLayoutDataType {
                "uint8",
                "int8",
                "uint16",
                "int16",
                "uint32",
                "int32",
                "float32",
                "float64"
            };
          
DataType description
uint8 8-bit unsigned integer.
int8 8-bit integer.
uint16 16-bit unsigned integer.
int16 16-bit integer.
uint32 32-bit unsigned integer.
int32 32-bit integer.
float32 32-bit IEEE floating point number.
float64 64-bit IEEE floating point number.

Pixel layout

For generalizing the variety of pixel layouts among image formats, here we propose the dictionary ChannelPixelLayout and the ImagePixelLayout which is a sequence of ChannelPixelLayout.

The ImagePixelLayout represents the pixel layout of a certain image format. Since an image format is composed of at least one channel, an ImagePixelLayout object contains at least one ChannelPixelLayout object.

Although an image or a video frame is a two-dimensional structure, its data is usually stored in a one-dimensional array in the row-major way and each channel describes how pixel values are arranged in the one dimensional array buffer.

A channel has an associated offset that denotes the beginning position of the channel's data relative to the given BufferSource parameter of the mapDataInto() method.

A channel has an associated width and height that denote the width and height of the channel respectively. Each channel in an image format may have different height and width.

A channel has an associated data type used to store one single pixel value.

A channel has an associated stride that is the number of bytes between the beginning two consecutive rows in memory. (The total bytes of each row plus the padding bytes of each raw.)

A channel has an associated skip value. The value is zero for the planar pixel layout, and a positive integer for the interleaving pixel layout. (Describes how many bytes there are between two adjacent pixel values in this channel.)

          Example1: RGBA image, width = 620, height = 480, stride = 2560

          chanel_r: offset = 0, width = 620, height = 480, data type = uint8, stride = 2560, skip = 3
          chanel_g: offset = 1, width = 620, height = 480, data type = uint8, stride = 2560, skip = 3
          chanel_b: offset = 2, width = 620, height = 480, data type = uint8, stride = 2560, skip = 3
          chanel_a: offset = 3, width = 620, height = 480, data type = uint8, stride = 2560, skip = 3

                  <---------------------------- stride ---------------------------->
                  <---------------------- width x 4 ---------------------->
          [index] 01234   8   12  16  20  24  28                           2479    2559
                  |||||---|---|---|---|---|---|----------------------------|-------|
          [data]  RGBARGBARGBARGBARGBAR___R___R...                         A%%%%%%%%
          [data]  RGBARGBARGBARGBARGBAR___R___R...                         A%%%%%%%%
          [data]  RGBARGBARGBARGBARGBAR___R___R...                         A%%%%%%%%
                       ^^^
                       r-skip
        
          Example2: YUV420P image, width = 620, height = 480, stride = 640

          chanel_y: offset = 0, width = 620, height = 480, stride = 640, skip = 0
          chanel_u: offset = 307200, width = 310, height = 240, data type = uint8, stride = 320, skip = 0
          chanel_v: offset = 384000, width = 310, height = 240, data type = uint8, stride = 320, skip = 0

                  <--------------------------- y-stride --------------------------->
                  <----------------------- y-width ----------------------->
          [index] 012345                                                  619      639
                  ||||||--------------------------------------------------|--------|
          [data]  YYYYYYYYYYYYYYYYYYYYYYYYYYYYY...                        Y%%%%%%%%%
          [data]  YYYYYYYYYYYYYYYYYYYYYYYYYYYYY...                        Y%%%%%%%%%
          [data]  YYYYYYYYYYYYYYYYYYYYYYYYYYYYY...                        Y%%%%%%%%%
          [data]  ......
                  <-------- u-stride ---------->
                  <----- u-width ----->
          [index] 307200              307509   307519
                  |-------------------|--------|
          [data]  UUUUUUUUUU...       U%%%%%%%%%
          [data]  UUUUUUUUUU...       U%%%%%%%%%
          [data]  UUUUUUUUUU...       U%%%%%%%%%
          [data]  ......
                  <-------- v-stride ---------->
                  <- --- v-width ----->
          [index] 384000              384309   384319
                  |-------------------|--------|
          [data]  VVVVVVVVVV...       V%%%%%%%%%
          [data]  VVVVVVVVVV...       V%%%%%%%%%
          [data]  VVVVVVVVVV...       V%%%%%%%%%
          [data]  ......
        
          Example3: YUV420SP_NV12 image, width = 620, height = 480, stride = 640

          chanel_y: offset = 0, width = 620, height = 480, stride = 640, skip = 0
          chanel_u: offset = 307200, width = 310, height = 240, data type = uint8, stride = 640, skip = 1
          chanel_v: offset = 307201, width = 310, height = 240, data type = uint8, stride = 640, skip = 1

                  <--------------------------- y-stride -------------------------->
                  <----------------------- y-width ---------------------->
          [index] 012345                                                 619      639
                  ||||||-------------------------------------------------|--------|
          [data]  YYYYYYYYYYYYYYYYYYYYYYYYYYYYY...                       Y%%%%%%%%%
          [data]  YYYYYYYYYYYYYYYYYYYYYYYYYYYYY...                       Y%%%%%%%%%
          [data]  YYYYYYYYYYYYYYYYYYYYYYYYYYYYY...                       Y%%%%%%%%%
          [data]  ......
                  <--------------------- u-stride / v-stride -------------------->
                  <------------------ u-width + v-width ----------------->
          [index] 307200(u-offset)                                       307819  307839
                  |------------------------------------------------------|-------|
          [index] |307201(v-offset)                                      |307820 |
                  ||-----------------------------------------------------||------|
          [data]  UVUVUVUVUVUVUVUVUVUVUVUVUVUVUV...                      UV%%%%%%%
          [data]  UVUVUVUVUVUVUVUVUVUVUVUVUVUVUV...                      UV%%%%%%%
          [data]  UVUVUVUVUVUVUVUVUVUVUVUVUVUVUV...                      UV%%%%%%%
                   ^            ^
                  u-skip        v-skip
        
          Example4: DEPTH image, width = 640, height = 480, stride = 1280

          chanel_d: offset = 0, width = 640, height = 480, data type = uint16, stride = 1280, skip = 0

                  <----------------------- d-stride ---------------------->
                  <----------------------- d-width ----------------------->
          [index] 012345                                                  1280
                  ||||||--------------------------------------------------|
          [data]  DDDDDDDDDDDDDDDDDDDDDDDDDDDDD...                        D
          [data]  DDDDDDDDDDDDDDDDDDDDDDDDDDDDD...                        D
          [data]  DDDDDDDDDDDDDDDDDDDDDDDDDDDDD...                        D
          [data]  ......
        

ChannelPixelLayout dictionary

Each channel is represented by an ChannelPixelLayout object.

            dictionary ChannelPixelLayout {
                required unsigned long offset;
                required unsigned long width;
                required unsigned long height;
                required ChannelPixelLayoutDataType dataType;
                required unsigned long stride;
                required unsigned long skip;
            };
          

The offset attribute represents the channel's offset.

The width attribute represents the width of the channel. (Channels in an image format may have different width.)

The height attribute represents the height of the channel. (Channels in an image format may have different height.)

The dataType attribute must return the data type of the channel, one of the channel data types.

The stride attribute represents the stride of the channel.

The skip attribute represents the skip value for the channel.

ImagePixelLayout definition

            typedef sequence<ChannelPixelLayout> ImagePixelLayout;
          

ImageBitmap interface

          [Exposed=(Window,Worker)]
          partial interface ImageBitmap {
              ImageFormat                     findOptimalFormat (optional sequence<ImageFormat> possibleFormats);
              long                            mappedDataLength (ImageFormat format);
              Promise<ImagePixelLayout> mapDataInto (ImageFormat format, BufferSource buffer, long offset, long length);
          };
        

The findOptimalFormat(possibleFormats) method must run the following steps:

  1. Let image bitmap be the object on which the method was invoked.
  2. Let possible formats be the first argument.
  3. If possible formats is empty, return ImageFormat that is the most suitable image format for image bitmap, and terminate these steps.
    It is up to the implementation to decide how to choose the most suitable image format from a list of image formats.
  4. If none of the image formats in possible formats is an image format that the user agent knows it can render, return the empty string and terminate these steps.
  5. Otherwise, return ImageFormat that is the most suitable image format out of possible formats for image bitmap, and terminate these steps.

The mappedDataLength(format) method must run the following steps:

  1. Let image bitmap be the object on which the method was invoked.
  2. Let format be the first argument.
  3. If the user agent cannot render image bitmap represented in format, throw NotSupportedError.
  4. Otherwise, return the length of the pixel layout for image bitmap represented in format image format.

The mapDataInto(format, buffer, offset, length) method must run the following steps:

  1. Let promise be a new promise.
  2. Run these substeps in parallel:
    1. Let image bitmap be the object on which the method was invoked.
    2. If image bitmap was cropped to the source rectangle so that it contains any transparent black pixels (cropping area is outside of the source image), then reject promise with IndexSizeError and abort these steps.
    3. Let format, buffer, offset, and length be the similarly named method arguments.
    4. If the user agent cannot render image bitmap represented in format, reject promise with NotSupportedError and terminate these steps.
    5. Make a copy of the underlying image data of image bitmap in the given format format into BufferSource buffer at offset offset, filling at most length bytes.
    6. Let pixel layout be a new ImagePixelLayout object that represents the image data in the previous step.
    7. Resolve promise with pixel layout.
  3. Return promise.

ImageBitmapFactories interface

          [NoInterfaceObject, Exposed=(Window,Worker)]
          partial interface ImageBitmapFactories {
              Promise<ImageBitmap> createImageBitmap (BufferSource buffer, long offset, long length, ImageFormat format, ImagePixelLayout layout);
          };
        

The createImageBitmap(buffer, offset, length, format, layout) method must run the following steps:

  1. Let buffer be the container for the raw image data.
  2. Let offset be the offset (beginning position) of buffer.
  3. Let length be the length of spaces in buffer in which the raw image data is placed into.
  4. Let format be the the image format of the raw image data placed in buffer.
  5. Let layout be the pixel layout for the raw image data, which describes how the data is arranged in buffer using format.
  6. If buffer has been neutered, reject promise with InvalidStateError and terminate these steps.
  7. Let image bitmap be a newly created ImageBitmap object.
  8. Set image bitmap's bitmap data to the image data given by the BufferSource buffer.
  9. Return a new promise, but continue running these steps in parallel.
  10. Resolve promise with the new ImageBitmap object as the value.

Examples

This example demonstrates how to hook a worker to a MediaStreamTrack as a video processor and use the ImageBitmap extensions from within a worker script.

A non-worker example should be added too for completeness.
          <script>
          var processor = new VideoProcessor();
          var inputTrack = mediaStream.getVideoTracks()[0];
          var outputTrack = inputTrack.addVideoProcessor(processor);
          newMediaStream.addTrack(outputTrack);
          var worker = new Worker("processing.js");
          worker.postMessage({aCommand : 'pass_processor', aProcessor: processor}, [processor]);
          </script>
        

The worker executes the following processing.js script:

          self.onmessage = function(msg) {
            switch (msg.data.aCommand) {
                  case 'pass_processor':
                          bindProcessor(msg.data.aProcessor);
                      break;
                  default:
                      throw 'no aTopic on incoming message to Worker';
              }
          };

          function bindProcessor(processor) {
            processor.onvideoprocessorchange = function(event) {
              // Check if the browser supports YUV format.
              var bitmap = event.inputImageBitmap;
              var yuvFormats = ["YUV444P", "YUV422P", "YUV420P", "YUV420SP_NV12", "YUV420SP_NV21"];
              var bitmapFormat = bitmap.findOptimalFormat(yuvFormats);
              if (bitmapFormat == "") {
                console.log("The browser does not support YUV formats.");
                return;
              }
              // Get the need buffer size to read the image data in YUV format.
              var bitmapBufferLength = bitmap.mappedDataLength(bitmapFormat);

              // Create the buffer for mapping data out.
              var bitmapBuffer = new ArrayBuffer(bitmapBufferLength);
              var bitmapBufferView = new Uint8ClampedArray(bitmapBuffer, 0, bitmapBufferLength);

              // Map the bitmap's data into the buffer created in the previous step.
              var promise = bitmap.mapDataInto(bitmapFormat, bitmapBuffer, 0, bitmapBufferLength);
              promise.then(function(bitmapPixelLayout) {
                // Read out the y-channel properties.
                var ywidth  = bitmapPixelLayout.channels[0].width;
                var yheight = bitmapPixelLayout.channels[0].height;
                var yoffset = bitmapPixelLayout.channels[0].offset;
                var ystride = bitmapPixelLayout.channels[0].stride;
                var yskip   = bitmapPixelLayout.channels[0].skip;   // This should be 0.

                // Initialize the buffer for the result gray image.
                var rgbaBufferLength = ywidth * yheight * 4;
                var rgbaBuffer = new ArrayBuffer(rgbaBufferLength);
                var rgbaBufferView = new Uint8ClampedArray(rgbaBuffer, 0, rgbaBufferLength);

                // Convert YUV to Gray.
                for (var i = 0; i < yheight; ++i) {
                  for (var j = 0; j < ywidth; ++j) {
                    var index = ystride * i + j;
                    var y = parseFloat(bitmapBufferView[yoffset + index]);
                    rgbaBufferView[index * 4 + 0] = y;
                    rgbaBufferView[index * 4 + 1] = y;
                    rgbaBufferView[index * 4 + 2] = y;
                    rgbaBufferView[index * 4 + 3] = 255;
                  }
                }

                // Create a new ImageBitmap from the processed rgbaBuffer and assign to the
                // event.outputImageBitmap.
                var channelR = {offset:0, width:ywidth, height:yheight, dataType:"uint8", stride:ywith * 4, skip:3);
                var channelG = {offset:1, width:ywidth, height:yheight, dataType:"uint8", stride:ywith * 4, skip:3);
                var channelB = {offset:2, width:ywidth, height:yheight, dataType:"uint8", stride:ywith * 4, skip:3);
                var channelA = {offset:3, width:ywidth, height:yheight, dataType:"uint8", stride:ywith * 4, skip:3);
                var layout = [channelR, channelG, channelB, channelA];
                var p = createImageBitmap(rgbaBuffer, 0, rgbaBufferLength, "RGBA32", layout);
                p.then(function(bitmap) {
                  event.outputImageBitmap = bitmap;
                }).catch(function(ex) {
                  console.log("Call createImageBitmap() failed. Error: " + ex);
                });
              },
              function(ex) {
                console.log("Call mapDataInto() failed. Error: " + ex);
              });
            };
          }
        

Acknowledgements

Thanks to Robert O'Callahan for his idea of this design.