W3C Workshop on Web and Machine Learning

Media processing hooks for the Web - by François Daoust (W3C)

Previous: Web Platform: a 30,000 feet view / Web Platform and JS environment constraints All talks Next: Access purpose-built ML hardware with Web Neural Network API

    1st

slideset

Slide 1 of 40

W3C workshop on
Web and Machine Learning

Media processing hooks for the Web

François Daoust – @tidoust
Summer 2020

Hello.

I'm François Daoust.

I work for W3C where I keep track of media related activities.

The goal of this presentation is to take a peak at mechanisms that allow,or will allow, media processing on the Web.

I will very hardly touch on machine learning per se, just as a possible approach that people might want to apply to process media.

Main media scenarios

① Progressive media playback
Playback of media files
<audio>, <video>, HTMLMediaElement in HTML
② Adaptive media playback
Professional/Commercial playback
Media Source Extensions (MSE)
③ Real-time conversations
Audio/Video conferencing
WebRTC
④ Synthesized audio playback
Music generation, Sound effects
Web Audio API

In the past few years, I would say that standardization has enabled 4 different media scenarios on the Web.

First one is support for basic media playback, also known as progressive media playback, in other words the ability to play a simple media file.

This was enabled by the audio and video tags in HTML and the underlying HTMLMediaElement interface.

Second scenario is support for adaptive media playback.

Here, the goal is to adjust video parameters (typically, the resolution) in real time to the user's current environment.

This scenario was enabled by Media Source Extensions or MSE for short.

It brought professional and commercial media content to the Web.

MSE is, by far, the main mechanism used to stream on-demand media content on the Web today.

Third media scenario is different.

It is about real-time audio/video conversations.

This was enabled by WebRTC.

And WebRTC is also used to stream media in scenarios that require latency to be minimal, for instance cloud gaming.

Fourth media scenario is also completely different.

It is about synthesized audio playback: to generate music, or sound effects in games for instance.

This was enabled by the Web Audio API.

Media content

① Progressive media playback
Media container file (e.g. MP4, OGG, WebM)
Multiplex of encoded audio/video tracks (e.g. H.264, AV1, AAC)
② Adaptive media playback
Media stream (e.g. ISOBMFF, MPEG2-TS, WebM)
Data segments assembled on the fly
③ Real-time conversations
Individual encoded/decoded Media stream tracks
Coupling between encoding/decoding and transport
④ Synthesized audio playback
Short encoded/decoded audio samples

The actual media content depends on the scenario.

We're talking about files for progressive media playback, streams for adaptive media playback, individual media stream tracks for WebRTC and short audio samples for Web Audio.

Media encoding/decoding pipeline

In a typical media pipeline, media content first gets recorded from a camera or microphone. This creates a raw stream of audio/video frames. These raw frames are then encoded to save memory and bandwidth, and multiplexed (also known as muxed) to mix related audio/video/text tracks before the result may be sent over the network, either directly to a receiving peer or to a server. The decoding side is roughly symmetric. Media content needs to be fetched, demuxed, decoded to produce raw audio and video frames, and rendered to the final display or speakers.

In a typical media pipeline, media content gets recorded from a camera or microphone, encoded to save memory and bandwidth.

Then different media tracks are muxed together and the result is sent to the network.

The decoding side is roughly symmetric to the encoding side.

Media content needs to be fetched, demuxed, decoded to produce raw audio and video frames, and rendered to the final display or speakers.

The mux and demux operations are only needed in progressive and adaptive media playback scenarios.

WebRTC deals with individual media stream tracks directly so no mux and demux operations per se there.

In the interest of time, I will mostly focus on the decoding side, but some considerations apply to the encoding side as well.

Media processing scenarios

Media stream analysis
  • Barcode reading
  • Face recognition
  • Gesture/Presence tracking
  • Emotion analysis
  • Speech recognition
  • Depth data streams processing
Media stream manipulation
  • Funny hats
  • Voice effects
  • Background removal or blurring
  • In-browser composition
  • Augmented reality
  • Non-linear video edition

So why would people want to process media streams?

Well, they may want to analyze frames to detect objects, faces, or gestures.

Or they may want to actually modify the streams in real time, either to add overlays in real time (for instance to add funny hats), remove the background, or simply because the client application is a media authoring tool.

Processing hooks needed!

Most media processing scenarios need processing hooks either on the encoder side between the record and encode operations, or on the decoder side between the decode and render operations.

In most cases, these scenarios need to process individual decoded audio or video frames.

So, if we look back at our typical media pipeline, ideally, we'd like to hook processing between the record and the encode operations on the encoding side, and between the decode and the render operations on the decoding side.

Existing hooks for…
① progressive media playback

HTMLMediaElement takes the URL of a media file and renders audio and video frames. The browser does all the work (fetch, demux, decode, render). In essence, the media element does not expose any hook to process encoded or decoded frames.

No hooks for progressive media playback…

OK, let's get back to our four media scenarios.

For progressive media playback, the HTMLMediaElement takes the URL of a media file and renders the result.

The browser does all the work.

And that's too bad, because that means there is no way to hook any processing step there!

Existing hooks for…
① progressive media playback

One can still process rendered frames by drawing them repeatedly onto a canvas to access to actual pixels, processing the contents of the canvas, and rendering the result onto a final canvas.

No hooks for progressive media playback…
… but you can create one for video frames with <canvas>.

But you can cheat!

You create your own processing hook.

For video, you can copy rendered frames repeatedly to a canvas element , process pixels extracted from the canvas and render the result to a canvas.

The solution is not fantastic because it is not very efficient, but at least it exists.

Existing hooks for…
② adaptive media playback

MSE allows applications to take control of the fetch and demux operations. The rest of the pipeline remains handled by HTMLMediaElement.

No hooks for adaptive media playback…
… but you can also use the <canvas> workaround.

MSE is essentially a demuxing API.

In other words, it allows applications to take control of the demux operation and, by extension, of the fetch operation.

MSE does not expose decoded frames, so no processing hook either but the same canvas “hack” can be used.

Existing hooks for…
③ real-time conversations

WebRTC takes care of the fecth and decode operations (no demux in WebRTC scenarios). However, decoded frames remain opaque from an application perspective and can only be attached to an HTMLMediaElement in practice.

No hooks for WebRTC either…
… same <canvas> workaround possible.

For WebRTC, first remember that there is no demux operation.

WebRTC takes care of the fetch and decode operations.

In theory, that's fantastic news.

We should get decoded frames out of WebRTC.

However, in practice, a “decoded” stream in WebRTC is represented as an abstract and opaque MediaStreamTrack object and the actual contents of the decoded audio and video frames are not exposed to applications.

So, back to the same canvas hack again...

Existing hooks for…
④ synthesized audio playback

  • The Web Audio API is a processing API at its heart
  • Custom processing code can be injected through audio worklets
  • No support for streaming and no indication of progress
  • Good fit for short audio samples, not long streams of audio

The Web Audio API is a processing API at its heart.

However, the whole API is meant to operate on an entire file.

There is no support for streaming and no indication of progress when processing large file.

This works well for short audio samples, but is really not a good fit for streaming scenarios.

Existing hooks…
summary

  • Dedicated API for audio but not geared towards streaming scenarios
  • No existing processing hook at the right step otherwise
  • Rendered video frames may be copied to a <canvas> and processed
    (but that is not very efficient ☹)

In other words, today, the only way to process audio is through the Web Audio API; and that is only really useful for short audio samples.

And the only way to process video is by capturing rendered frames onto a canvas, which is not very efficient.

Media pipeline in JavaScript / WebAssembly

The entire media pipeline may be implemented in JavaScript / WebAssembly, rendering the result to a canvas
  • Applications may also implement the entire media pipeline in JS/WASM
  • Obviously allows to have processing hooks wherever needed
  • Increased bandwidth, bad performance, reduced power efficiency ☹

I lied.

There is another way!

A Web application may also choose to implement the entire media pipeline on its own, using JavaScript or WebAssembly, rendering the result to a canvas.

This is suboptimal, especially on constrained devices, but know that some media authoring applications actually do that.

Main requirements for efficient media processing

Raw data processing needs:
1 video frame in a HD video ≈ 1920x1080 x 3 components x 4 bytes ≈ 24MB
1 second of HD video ≈ 25 * 24MB ≈ 600MB

  • Avoid copies in/across memory (e.g. CPU, GPU, Worker, WebAssembly)
  • Expose as raw data as possible (e.g. YUV or ARGB for decoded video)
  • Leverage processing power (e.g. workers, WebAssembly, GPU, ML)
  • Difficulty to create inefficient processing pipelines
  • Control over processing speed (real-time or faster than real-time)
  • Stream processing (as opposed to processing of entire files)
  • Processing should not de-synchronize media streams

If existing solutions are not efficient enough, what would make media processing efficient?

It essentially boils down to sizes.

To give you a rough idea, a single video frame in a full HD video weighs about 24MB, so processing a full HD video in real time means processing about 600MB of raw data per second.

Any efficient solution should avoid having to manipulate or copy bytes around, while allowing to leverage the CPU, the GPU, WebAssembly, Machine Learning hardware, while also giving some knobs to control the processing speed and, last but not least, while also keeping related tracks synchronized!

That's a lot of requirements and the list is not even exhaustive.

That typically explains why decoded media has remained opaque until now.

It is hard to expose decoded media on the Web!

WebCodecs

  • Efficient access to built-in (software and hardware) media encoders/decoders
  • Exposes needed media processing hooks
WebCodecs allow applications to take control of the decode operation in the decoding pipeline and to feed the result into a canvas. As opposed to WebRTC, applications would be able to access decoded frames for processing.

It is hard, but it is not impossible.

The WebCodecs API, initially proposed by Google, aims at providing access to built-in media encoders and decoders.

As a by-product, the API exposes decoded frames, allowing applications to hook into the output of the API to process these frames.

Note that the API does not take care of the mux/demux operations, which will have to be done by the application through some JavaScript or WebAssembly code.

This shouldn't be a problem in practice, these operations are not really heavy on the CPU.

WebCodecs — status

  • Interfaces being shaped as this talk is recorded
  • Lots of open questions, e.g. around integration with other specs such as:
  • Not based on WHATWG Streams
  • Incubation in WICG, potential deliverable for Media WG

The API is very much in flux, so it's hard to assess anything on WebCodecs for now.

There are a number of open questions, starting with “how can WebCodecs be integrated with other specs?

”.

WebAssembly, MSE, WebRTC, WebXR for rendering video in immersive contexts.

Well, and of course, the integration question also exists for machine learning APIs.

It's worth noting that the WebRTC Working Group has started to work WebRTC Insertable Media using Streams that should eventually build on top of WebCodecs.

By the way, it seems worth noting that WebCodecs is not based on WHATWG Streams.

That was the initial goal, but it turned out to be very difficult.

So having a way to process WHATWG Streams does not necessarily mean that it will be de facto possible to process media streams.

Incubation takes place in the Web Platform Incubator Community Group.

The Media Working Group may adopt and standardize the proposal once it's ready.

Other media features and specs that may impact processing

  • Codec switching in MSE
  • HTMLVideoElement.requestVideoFrameCallback()
  • Immersive videos: 360°, volumetric, 3D
  • Media Timed Events
  • Encrypted media

See the Overview of Media Technologies for the Web document for details

Before I conclude, I'd like to raise a few other media technologies and proposals that may or may not affect the way processing gets done in the future.

For instance, the Media Working Group standardizes a codec switching feature for MSE, for instance to allow to insert an ad which may be in low resolution in the middle of a 4K video.

How would such a switch affect processing?

In some “processing” scenarios, all that may be required is rendering an overlay on top of a video without touching the video per se.

That is precisely what the HTMLVideoElement.requestVideoFrameCallback proposal is offering.

How does that affect everything that I presented so far?

Flat videos are no longer the only videos in town.

Other types of videos get rendered for VR/AR and more immersive experiences.

This includes 360 videos, volumetric videos, and more generally 3D playback experiences.

How does processing apply to these cases?

Also, I have only focused on audio and video tracks, but media content also contains events tracks that are more and more used to control the playback experience.

Media industries are putting a lot of efforts these days to standardize solutions to expose media timed events to applications.

The generic idea is to rely on the user agent to do the work.

If WebCodecs requires applications to demux media content themselves, then what happens to events tracks?

One final note on encrypted media.

In general, the goal there is to hide the bytes from the applications, so encrypted media is typically not a good candidate for application-controlled processing operations.

There may still be scenarios where the ability to process encrypted media in a separate sandbox could be useful.

So, something that could be worth keeping in mind.

All these features and others are mentioned in the Overview of Media Technologies for the Web, which I invite you to take a peak at.

Conclusion

  • Exposing decoded frames to applications is easier said than done
  • No satisfactory solution for media processing today
  • WebCodecs to the rescue!
  • Coordination needed for integrating WebCodecs in other technologies

Thank you!

So, what I have tried to convey here is that exposing decoded media frames to applications is doable but not easy, or not as easy as it might sound.

There is no satisfactory solution today.

WebCodecs seems very promising but there are lots of open questions that still need to be discussed.

In order to be successful, the work on WebCodecs would greatly benefit from coordination and engagement from people involved in other technologies.

And that includes people looking at exposing machine learning capabilities to Web applications!

OK, I'm done, thank you, bye!

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Web Platform: a 30,000 feet view / Web Platform and JS environment constraints All talks Next: Access purpose-built ML hardware with Web Neural Network API

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.