This specification defines new VideoMonitor and
VideoProcessor interfaces, and extends MediaStreamTrack
and
ImageBitmap
interfaces to enable processing of video frames using script in a
performant manner.
The HTML specification [[!HTML]] defines the ImageBitmap interface that represents a bitmap image. The Media Capture and Streams specification [[!GETUSERMEDIA]] allows a web page to access video streams sourced from cameras that are exposed to the underlying platform and defines the MediaStreamTrack interface that represents a media source. This specification extends these interfaces to define a model for processing video frames sourced from media stream tracks by a script in a performant manner. This enables use cases such as video editing, digital image processing, and object recognition among other advanced media usages on the Web Platform in a performant manner.
In this model, a video monitor or video processor is
associated with an input media stream track. Events containing
input frame data from the input media
stream track are dispatched at the video monitor or video
processor. A Web developer is able to monitor and process the
input frame and produce an output frame using script.
Finally, the provided output frame is fed into the output
media stream track on the main thread. The output media stream
track can then be used to construct a media stream and, for
example, provided to media sinks such as <video>
for
rendering.
The design principle of this push-like mechanism for video processing is depicted in the following figure:
This specification attempts to address the Use Cases and Requirements for expanding the Web Platform to support image processing and computer vision usages.
This specification defines conformance criteria that apply to a single product: the user agent that implements the interfaces that it contains.
Implementations that use ECMAScript to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification [[!WEBIDL]], as this specification uses that specification and terminology.
The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be interpreted as described in RFC2119. For readability, these words do not appear in all uppercase letters in this specification. [[!RFC2119]]
The
MediaStreamTrack
and
MediaStream
interfaces this specification extends
and the source
concept are defined in [[!GETUSERMEDIA]].
The ImageBitmap
interface and
ImageBitmapFactories
interface this specification
extends are defined in [[!HTML51]].
The BufferSource
and the ArrayBuffer
are defined in [[!WEBIDL]]
The Promise
object is defined in [[!ECMASCRIPT]].
The following concepts and interfaces are defined in [[!HTML]]:
An input frame is an ImageBitmap object representing a frame sourced from the input media stream track associated with video monitor or video processor.
An output frame is an ImageBitmap object assigned to the outputImageBitmap attribute of the VideoProcessorEvent dispatched at video processor.
The input media stream track is the MediaStreamTrack
object on which the addVideoMonitor()
or
addVideoProcessor()
method is invoked.
The output media stream track is a MediaStreamTrack
object returned by the invocation of the
addVideoProcessor()
method.
There are two kinds of video events:
The VideoMonitor interface is used to analyze and VideoProcessor interface in addition to generate and process video data directly using JavaScript. The video monitor events are dispatched at VideoMonitor objects, and video processor events are dispatched at VideoProcessor objects.
[Constructor] interface VideoMonitor : EventTarget { attribute EventHandler onvideomonitorchange; };
[Constructor] interface VideoProcessor : VideoMonitor { attribute EventHandler onvideoprocessorchange; };
The following are the event handlers (and their corresponding event handler event types) that must be supported, as event handler IDL attributes, by objects implementing the VideoMonitor interface:
Event handlers | Event handler event type |
---|---|
onvideomonitorchange
|
videomonitorchange
|
The following are the event handlers (and their corresponding event handler event types) that must be supported, as event handler IDL attributes, by objects implementing the VideoProcessor interface:
Event handlers | Event handler event type |
---|---|
onvideoprocessorchange
|
videoprocessorchange
|
To fire a video event named e, the user agent must run the following steps:
id
attribute of the input media stream track.
currentTime
attribute of the MediaStream that
contains the input media stream track.
VideoMonitorEvent
interface
The video monitor event contains an input frame and its metadata originating from the input media stream track. It uses the VideoMonitorEvent interface for its videomonitorchange events:
[Constructor(DOMString type, optional VideoMonitorEventInit videoMonitorEventInitDict)] interface VideoMonitorEvent : Event { readonly attribute DOMString trackId; readonly attribute double playbackTime; readonly attribute ImageBitmap? inputImageBitmap; }; dictionary VideoMonitorEventInit : EventInit { required DOMString trackId; required double playbackTime; required ImageBitmap? inputImageBitmap; };
The trackId
attribute must return the value
it was initialized to. When the object is created, this attribute
must be initialized to the empty string. It represents the identifier
that is shared with the video monitor and its input media
stream track.
The playbackTime
attribute must return the
value it was initialized to. When the object is created, this
attribute must be initialized to zero. It represents the current
stream position, in seconds.
The inputImageBitmap
attribute must return
the value it was initialized to. When the object is created, this
attribute must be initialized to null. It represents the
ImageBitmap object whose bitmap data is provided by the
input media stream track.
When a user agent is required to fire a video monitor event, it must fire a video event named videomonitor.
VideoProcessorEvent
interface
The video processor event inherits from the video monitor event, and in addition provides means to programmatically construct an output frame for the output media stream track. It uses the VideoProcessorEvent interface for its videoprocessorchange events:
[Constructor(DOMString type, optional VideoProcessorEventInit videoProcessorEventInitDict)] interface VideoProcessorEvent : VideoMonitorEvent { attribute ImageBitmap? outputImageBitmap; }; dictionary VideoProcessorEventInit : VideoMonitorEventInit { required ImageBitmap? outputImageBitmap; };
The outputImageBitmap
attribute must return
the value it was initialized to. When the object is created, this
attribute must be initialized to null. It represents the
ImageBitmap object which on setting, must cause the user agent
to run the following steps:
addVideoProcessor()
method.
When a user agent is required to fire a video processor event, it must fire a video event named videoprocessorchange.
Ideally the MediaStreamTrack
should dispatch each video
frame through VideoProcessorEvent. But sometimes the worker
thread could not process the frame in time. So the implementation
could skip the frame to avoid high memory footprint. In such case, we
might not be able to process every frame in a real time
MediaStream
.
MediaStreamTrack
interface
partial interface MediaStreamTrack { void addVideoMonitor(VideoMonitor monitor); void removeVideoMonitor(VideoMonitor monitor); MediaStreamTrack addVideoProcessor(VideoProcessor processor); void removeVideoProcessor(); };
The addVideoMonitor()
method, when invoked,
must run these steps:
The removeVideoMonitor()
method, when
invoked, must run these steps:
The addVideoProcessor()
method, when
invoked, must run these steps:
The removeVideoProcessor()
method, when
invoked, must run these steps:
The ImageBitmap interface is originally designed as a pure
opaque handler to an image data buffer inside a browser so that how
the browser stores the buffer is uknown to users and optimized to
platforms. In this specification, we chooses ImageBitmap
(instead of ImageData
) as the container of video frames
because the decoded video frame data might exist in either CPU or GPU
memory which perfectly matches the nature of
ImageBitmap
as an opaque handler.
Considering how would developers process video frames, there are two possible approaches, via pure JavaScript(/asm.js) code or via WebGL.
In this specification, the original ImageBitmap interface is extended with three methods to let developers read data from an ImageBitmap object into a given BufferSource in a set of supported ImageFormats. How the accessed image data is arranged in memory is described by the proposed ImagePixelLayout and dictionary ChannelPixelLayout. Also, the ImageBitmapFactories interface is extended to let developers create an ImageBitmap object from a given BufferSource.
An image or a video frame is conceptually a two-dimentional array of data and each element in the array is called a pixel. The pixels are usually stored in a one-dimensional array and could be arranged in a variety of image formats. Developers need to know how the pixels are formatted so that they are able to process them.
The image format describes how pixels in an image are arranged. A single pixel has at least one, but usually multiple pixel values. The range of a pixel value varies, which means different image formats use different data types to store a single pixel value.
The most frequently used data type is 8-bit unsigned integer whose range is from 0 to 255, others could be 16-bit integer or 32-bit floating points and so forth. The number of pixel values of a single pixel is called the number of channels of the image format. Multiple pixel values of a pixel are used together to describe the captured property which could be color or depth information. For example, if the data is a color image in RGB color space, then it is a three-channel image format and a pixel is described by R, G and B three pixel values with range from 0 to 255. As another example, if the data is a gray image, then it is a single-channel image format with 8-bit unsigned integer data type and the pixel value describes the gray scale. For depth data, it is a single channel image format too, but the data type is 16-bit unsigned integer and the pixel value is the depth level.
For those image formats whose pixels contain multiple pixel values, the pixel values might be arranged in one of the following ways:
Image formats that belong to the same color space might have different pixel layouts.
ImageFormat
enumeration
The ImageFormat enumeration is used to select the image format for the ImagePixelLayout. The ImageBitmap extensions defined in this specification use this enumeration to negotiate the image format while accessing the underlying data of ImageBitmap and creating a new ImageBitmap.
We need to elaborate this list for standardization.
enum ImageFormat { "RGBA32", "BGRA32", "RGB24", "BGR24", "GRAY8", "YUV444P", "YUV422P", "YUV420P", "YUV420SP_NV12", "YUV420SP_NV21", "HSV", "Lab", "DEPTH", /* empty string */ "" };N/A N/A
ImageFormat | Channel order | Channel size | Pixel layout | Data type |
---|---|---|---|---|
RGBA32
|
R, G, B, A | full rgba-channels | interleaving rgba-channels | 8-bit unsigned integer |
BGRA32
|
B, G, R, A | full bgra-channels | interleaving bgra-channels | 8-bit unsigned integer |
RGB24
|
R, G, B | full rgb-channels | interleaving rgb-channels | 8-bit unsigned integer |
BGR24
|
B, G, R | full bgr-channels | interleaving bgr-channels | 8-bit unsigned integer |
GRAY8
|
GRAY | full gray-channel | planar gray-channel | 8-bit unsigned integer |
YUV444P
|
Y, U, V | full yuv-channels | planar yuv-channels | 8-bit unsigned integer |
YUV422P
|
Y, U, V | full y-channel, half uv-channels | planar yuv-channels | 8-bit unsigned integer |
YUV420P
|
Y, U, V | full y-channel, quarter uv-channels | planar yuv-channels | 8-bit unsigned integer |
YUV420SP_NV12
|
Y, U, V | full y-channel, quarter uv-channels | planar y-channel, interleaving uv-channels | 8-bit unsigned integer |
YUV420SP_NV21
|
Y, V, U | full y-channel, quarter uv-channels | planar y-channel, interleaving vu-channels | 8-bit unsigned integer |
HSV
|
H, S, V | full hsv-channels | interleaving hsv-channels | 32-bit IEEE floating point number |
Lab
|
l, a, b | full lab-channels | interleaving lab-channels | 32-bit IEEE floating point number |
DEPTH
|
DEPTH | full depth-channel | planar depth-channel | 16-bit unsigned integer |
"" (the empty string)
|
N/A | N/A | N/A | N/A |
ChannelPixelLayoutDataType
enumeration
The ChannelPixelLayoutDataType enumeration is used to select the channel data type that is used to store a single pixel value.
enum ChannelPixelLayoutDataType { "uint8", "int8", "uint16", "int16", "uint32", "int32", "float32", "float64" };
DataType | description |
---|---|
uint8
|
8-bit unsigned integer. |
int8
|
8-bit integer. |
uint16
|
16-bit unsigned integer. |
int16
|
16-bit integer. |
uint32
|
32-bit unsigned integer. |
int32
|
32-bit integer. |
float32
|
32-bit IEEE floating point number. |
float64
|
64-bit IEEE floating point number. |
For generalizing the variety of pixel layouts among image formats, here we propose the dictionary ChannelPixelLayout and the ImagePixelLayout which is a sequence of ChannelPixelLayout.
The ImagePixelLayout represents the pixel layout of a certain image format. Since an image format is composed of at least one channel, an ImagePixelLayout object contains at least one ChannelPixelLayout object.
Although an image or a video frame is a two-dimensional structure, its data is usually stored in a one-dimensional array in the row-major way and each channel describes how pixel values are arranged in the one dimensional array buffer.
A channel has an associated offset that denotes the
beginning position of the channel's data relative to the given
BufferSource parameter of the
mapDataInto()
method.
A channel has an associated width and height that denote the width and height of the channel respectively. Each channel in an image format may have different height and width.
A channel has an associated data type used to store one single pixel value.
A channel has an associated stride that is the number of bytes between the beginning two consecutive rows in memory. (The total bytes of each row plus the padding bytes of each raw.)
A channel has an associated skip value. The value is zero for the planar pixel layout, and a positive integer for the interleaving pixel layout. (Describes how many bytes there are between two adjacent pixel values in this channel.)
Example1: RGBA image, width = 620, height = 480, stride = 2560 chanel_r: offset = 0, width = 620, height = 480, data type = uint8, stride = 2560, skip = 3 chanel_g: offset = 1, width = 620, height = 480, data type = uint8, stride = 2560, skip = 3 chanel_b: offset = 2, width = 620, height = 480, data type = uint8, stride = 2560, skip = 3 chanel_a: offset = 3, width = 620, height = 480, data type = uint8, stride = 2560, skip = 3 <---------------------------- stride ----------------------------> <---------------------- width x 4 ----------------------> [index] 01234 8 12 16 20 24 28 2479 2559 |||||---|---|---|---|---|---|----------------------------|-------| [data] RGBARGBARGBARGBARGBAR___R___R... A%%%%%%%% [data] RGBARGBARGBARGBARGBAR___R___R... A%%%%%%%% [data] RGBARGBARGBARGBARGBAR___R___R... A%%%%%%%% ^^^ r-skip
Example2: YUV420P image, width = 620, height = 480, stride = 640 chanel_y: offset = 0, width = 620, height = 480, stride = 640, skip = 0 chanel_u: offset = 307200, width = 310, height = 240, data type = uint8, stride = 320, skip = 0 chanel_v: offset = 384000, width = 310, height = 240, data type = uint8, stride = 320, skip = 0 <--------------------------- y-stride ---------------------------> <----------------------- y-width -----------------------> [index] 012345 619 639 ||||||--------------------------------------------------|--------| [data] YYYYYYYYYYYYYYYYYYYYYYYYYYYYY... Y%%%%%%%%% [data] YYYYYYYYYYYYYYYYYYYYYYYYYYYYY... Y%%%%%%%%% [data] YYYYYYYYYYYYYYYYYYYYYYYYYYYYY... Y%%%%%%%%% [data] ...... <-------- u-stride ----------> <----- u-width -----> [index] 307200 307509 307519 |-------------------|--------| [data] UUUUUUUUUU... U%%%%%%%%% [data] UUUUUUUUUU... U%%%%%%%%% [data] UUUUUUUUUU... U%%%%%%%%% [data] ...... <-------- v-stride ----------> <- --- v-width -----> [index] 384000 384309 384319 |-------------------|--------| [data] VVVVVVVVVV... V%%%%%%%%% [data] VVVVVVVVVV... V%%%%%%%%% [data] VVVVVVVVVV... V%%%%%%%%% [data] ......
Example3: YUV420SP_NV12 image, width = 620, height = 480, stride = 640 chanel_y: offset = 0, width = 620, height = 480, stride = 640, skip = 0 chanel_u: offset = 307200, width = 310, height = 240, data type = uint8, stride = 640, skip = 1 chanel_v: offset = 307201, width = 310, height = 240, data type = uint8, stride = 640, skip = 1 <--------------------------- y-stride --------------------------> <----------------------- y-width ----------------------> [index] 012345 619 639 ||||||-------------------------------------------------|--------| [data] YYYYYYYYYYYYYYYYYYYYYYYYYYYYY... Y%%%%%%%%% [data] YYYYYYYYYYYYYYYYYYYYYYYYYYYYY... Y%%%%%%%%% [data] YYYYYYYYYYYYYYYYYYYYYYYYYYYYY... Y%%%%%%%%% [data] ...... <--------------------- u-stride / v-stride --------------------> <------------------ u-width + v-width -----------------> [index] 307200(u-offset) 307819 307839 |------------------------------------------------------|-------| [index] |307201(v-offset) |307820 | ||-----------------------------------------------------||------| [data] UVUVUVUVUVUVUVUVUVUVUVUVUVUVUV... UV%%%%%%% [data] UVUVUVUVUVUVUVUVUVUVUVUVUVUVUV... UV%%%%%%% [data] UVUVUVUVUVUVUVUVUVUVUVUVUVUVUV... UV%%%%%%% ^ ^ u-skip v-skip
Example4: DEPTH image, width = 640, height = 480, stride = 1280 chanel_d: offset = 0, width = 640, height = 480, data type = uint16, stride = 1280, skip = 0 <----------------------- d-stride ----------------------> <----------------------- d-width -----------------------> [index] 012345 1280 ||||||--------------------------------------------------| [data] DDDDDDDDDDDDDDDDDDDDDDDDDDDDD... D [data] DDDDDDDDDDDDDDDDDDDDDDDDDDDDD... D [data] DDDDDDDDDDDDDDDDDDDDDDDDDDDDD... D [data] ......
ChannelPixelLayout
dictionary
Each channel is represented by an ChannelPixelLayout object.
dictionary ChannelPixelLayout { required unsigned long offset; required unsigned long width; required unsigned long height; required ChannelPixelLayoutDataType dataType; required unsigned long stride; required unsigned long skip; };
The offset attribute represents the channel's offset.
The width attribute represents the width of the channel. (Channels in an image format may have different width.)
The height attribute represents the height of the channel. (Channels in an image format may have different height.)
The dataType attribute must return the data type of the channel, one of the channel data types.
The stride attribute represents the stride of the channel.
The skip attribute represents the skip value for the channel.
ImagePixelLayout
definition
typedef sequence<ChannelPixelLayout> ImagePixelLayout;
ImageBitmap
interface
[Exposed=(Window,Worker)] partial interface ImageBitmap { ImageFormat findOptimalFormat (optional sequence<ImageFormat> possibleFormats); long mappedDataLength (ImageFormat format); Promise<ImagePixelLayout> mapDataInto (ImageFormat format, BufferSource buffer, long offset, long length); };
The findOptimalFormat(possibleFormats)
method must run the following steps:
The mappedDataLength(format)
method must
run the following steps:
NotSupportedError
.
The mapDataInto(format, buffer, offset,
length)
method must run the following steps:
IndexSizeError
and abort
these steps.
NotSupportedError
and terminate these steps.
ImageBitmapFactories
interface
[NoInterfaceObject, Exposed=(Window,Worker)] partial interface ImageBitmapFactories { Promise<ImageBitmap> createImageBitmap (BufferSource buffer, long offset, long length, ImageFormat format, ImagePixelLayout layout); };
The createImageBitmap(buffer, offset, length,
format, layout)
method must run the following steps:
InvalidStateError
and
terminate these steps.
ImageBitmap
object.
BufferSource
buffer.
This example demonstrates how to hook a worker to a MediaStreamTrack as a video processor and use the ImageBitmap extensions from within a worker script.
<script> var processor = new VideoProcessor(); var inputTrack = mediaStream.getVideoTracks()[0]; var outputTrack = inputTrack.addVideoProcessor(processor); newMediaStream.addTrack(outputTrack); var worker = new Worker("processing.js"); worker.postMessage({aCommand : 'pass_processor', aProcessor: processor}, [processor]); </script>
The worker executes the following processing.js
script:
self.onmessage = function(msg) { switch (msg.data.aCommand) { case 'pass_processor': bindProcessor(msg.data.aProcessor); break; default: throw 'no aTopic on incoming message to Worker'; } }; function bindProcessor(processor) { processor.onvideoprocessorchange = function(event) { // Check if the browser supports YUV format. var bitmap = event.inputImageBitmap; var yuvFormats = ["YUV444P", "YUV422P", "YUV420P", "YUV420SP_NV12", "YUV420SP_NV21"]; var bitmapFormat = bitmap.findOptimalFormat(yuvFormats); if (bitmapFormat == "") { console.log("The browser does not support YUV formats."); return; } // Get the need buffer size to read the image data in YUV format. var bitmapBufferLength = bitmap.mappedDataLength(bitmapFormat); // Create the buffer for mapping data out. var bitmapBuffer = new ArrayBuffer(bitmapBufferLength); var bitmapBufferView = new Uint8ClampedArray(bitmapBuffer, 0, bitmapBufferLength); // Map the bitmap's data into the buffer created in the previous step. var promise = bitmap.mapDataInto(bitmapFormat, bitmapBuffer, 0, bitmapBufferLength); promise.then(function(bitmapPixelLayout) { // Read out the y-channel properties. var ywidth = bitmapPixelLayout.channels[0].width; var yheight = bitmapPixelLayout.channels[0].height; var yoffset = bitmapPixelLayout.channels[0].offset; var ystride = bitmapPixelLayout.channels[0].stride; var yskip = bitmapPixelLayout.channels[0].skip; // This should be 0. // Initialize the buffer for the result gray image. var rgbaBufferLength = ywidth * yheight * 4; var rgbaBuffer = new ArrayBuffer(rgbaBufferLength); var rgbaBufferView = new Uint8ClampedArray(rgbaBuffer, 0, rgbaBufferLength); // Convert YUV to Gray. for (var i = 0; i < yheight; ++i) { for (var j = 0; j < ywidth; ++j) { var index = ystride * i + j; var y = parseFloat(bitmapBufferView[yoffset + index]); rgbaBufferView[index * 4 + 0] = y; rgbaBufferView[index * 4 + 1] = y; rgbaBufferView[index * 4 + 2] = y; rgbaBufferView[index * 4 + 3] = 255; } } // Create a new ImageBitmap from the processed rgbaBuffer and assign to the // event.outputImageBitmap. var channelR = {offset:0, width:ywidth, height:yheight, dataType:"uint8", stride:ywith * 4, skip:3); var channelG = {offset:1, width:ywidth, height:yheight, dataType:"uint8", stride:ywith * 4, skip:3); var channelB = {offset:2, width:ywidth, height:yheight, dataType:"uint8", stride:ywith * 4, skip:3); var channelA = {offset:3, width:ywidth, height:yheight, dataType:"uint8", stride:ywith * 4, skip:3); var layout = [channelR, channelG, channelB, channelA]; var p = createImageBitmap(rgbaBuffer, 0, rgbaBufferLength, "RGBA32", layout); p.then(function(bitmap) { event.outputImageBitmap = bitmap; }).catch(function(ex) { console.log("Call createImageBitmap() failed. Error: " + ex); }); }, function(ex) { console.log("Call mapDataInto() failed. Error: " + ex); }); }; }
Thanks to Robert O'Callahan for his idea of this design.