W3C workshop on
Web and Machine Learning
Media processing hooks for the Web
François Daoust – @tidoust
Summer 2020
Hello.
I'm François Daoust.
I work for W3C where I keep track of media related activities.
The goal of this presentation is to take a peak at mechanisms that allow,or will allow, media processing on the Web.
I will very hardly touch on machine learning per se, just as a possible approach that people might want to apply to process media.
In the past few years, I would say that standardization has enabled 4 different media scenarios on the Web.
First one is support for basic media playback, also known as progressive media playback, in other words the ability to play a simple media file.
This was enabled by the audio and video tags in HTML and the underlying HTMLMediaElement interface.
Second scenario is support for adaptive media playback.
Here, the goal is to adjust video parameters (typically, the resolution) in real time to the user's current environment.
This scenario was enabled by Media Source Extensions or MSE for short.
It brought professional and commercial media content to the Web.
MSE is, by far, the main mechanism used to stream on-demand media content on the Web today.
Third media scenario is different.
It is about real-time audio/video conversations.
This was enabled by WebRTC.
And WebRTC is also used to stream media in scenarios that require latency to be minimal, for instance cloud gaming.
Fourth media scenario is also completely different.
It is about synthesized audio playback: to generate music, or sound effects in games for instance.
This was enabled by the Web Audio API.
Media content
- ① Progressive media playback
- Media container file (e.g. MP4, OGG, WebM)
- Multiplex of encoded audio/video tracks (e.g. H.264, AV1, AAC)
- ② Adaptive media playback
- Media stream (e.g. ISOBMFF, MPEG2-TS, WebM)
- Data segments assembled on the fly
- ③ Real-time conversations
- Individual encoded/decoded Media stream tracks
- Coupling between encoding/decoding and transport
- ④ Synthesized audio playback
- Short encoded/decoded audio samples
The actual media content depends on the scenario.
We're talking about files for progressive media playback, streams for adaptive media playback, individual media stream tracks for WebRTC and short audio samples for Web Audio.
Media encoding/decoding pipeline
In a typical media pipeline, media content gets recorded from a camera or microphone, encoded to save memory and bandwidth.
Then different media tracks are muxed together and the result is sent to the network.
The decoding side is roughly symmetric to the encoding side.
Media content needs to be fetched, demuxed, decoded to produce raw audio and video frames, and rendered to the final display or speakers.
The mux and demux operations are only needed in progressive and adaptive media playback scenarios.
WebRTC deals with individual media stream tracks directly so no mux and demux operations per se there.
In the interest of time, I will mostly focus on the decoding side, but some considerations apply to the encoding side as well.
Media processing scenarios
- Media stream analysis
- Barcode reading
- Face recognition
- Gesture/Presence tracking
- Emotion analysis
- Speech recognition
- Depth data streams processing
- Media stream manipulation
-
- Funny hats
- Voice effects
- Background removal or blurring
- In-browser composition
- Augmented reality
- Non-linear video edition
So why would people want to process media streams?
Well, they may want to analyze frames to detect objects, faces, or gestures.
Or they may want to actually modify the streams in real time, either to add overlays in real time (for instance to add funny hats), remove the background, or simply because the client application is a media authoring tool.
Processing hooks needed!
In most cases, these scenarios need to process individual decoded audio or video frames.
So, if we look back at our typical media pipeline, ideally, we'd like to hook processing between the record and the encode operations on the encoding side, and between the decode and the render operations on the decoding side.
Existing hooks for…
① progressive media playback
No hooks for progressive media playback…
OK, let's get back to our four media scenarios.
For progressive media playback, the HTMLMediaElement takes the URL of a media file and renders the result.
The browser does all the work.
And that's too bad, because that means there is no way to hook any processing step there!
Existing hooks for…
① progressive media playback
No hooks for progressive media playback…
… but you can create one for video frames with <canvas>
.
But you can cheat!
You create your own processing hook.
For video, you can copy rendered frames repeatedly to a canvas element , process pixels extracted from the canvas and render the result to a canvas.
The solution is not fantastic because it is not very efficient, but at least it exists.
Existing hooks for…
② adaptive media playback
No hooks for adaptive media playback…
… but you can also use the <canvas>
workaround.
MSE is essentially a demuxing API.
In other words, it allows applications to take control of the demux operation and, by extension, of the fetch operation.
MSE does not expose decoded frames, so no processing hook either but the same canvas “hack” can be used.
Existing hooks for…
③ real-time conversations
No hooks for WebRTC either…
… same <canvas>
workaround possible.
For WebRTC, first remember that there is no demux operation.
WebRTC takes care of the fetch and decode operations.
In theory, that's fantastic news.
We should get decoded frames out of WebRTC.
However, in practice, a “decoded” stream in WebRTC is represented as an abstract and opaque MediaStreamTrack object and the actual contents of the decoded audio and video frames are not exposed to applications.
So, back to the same canvas hack again...
Existing hooks for…
④ synthesized audio playback
- The Web Audio API is a processing API at its heart
- Custom processing code can be injected through audio worklets
- No support for streaming and no indication of progress
- Good fit for short audio samples, not long streams of audio
The Web Audio API is a processing API at its heart.
However, the whole API is meant to operate on an entire file.
There is no support for streaming and no indication of progress when processing large file.
This works well for short audio samples, but is really not a good fit for streaming scenarios.
Existing hooks…
summary
- Dedicated API for audio but not geared towards streaming scenarios
- No existing processing hook at the right step otherwise
- Rendered video frames may be copied to a
<canvas>
and processed
(but that is not very efficient ☹)
In other words, today, the only way to process audio is through the Web Audio API; and that is only really useful for short audio samples.
And the only way to process video is by capturing rendered frames onto a canvas, which is not very efficient.
Media pipeline in JavaScript / WebAssembly
- Applications may also implement the entire media pipeline in JS/WASM
- Obviously allows to have processing hooks wherever needed
- Increased bandwidth, bad performance, reduced power efficiency ☹
I lied.
There is another way!
A Web application may also choose to implement the entire media pipeline on its own, using JavaScript or WebAssembly, rendering the result to a canvas.
This is suboptimal, especially on constrained devices, but know that some media authoring applications actually do that.
Main requirements for efficient media processing
Raw data processing needs:
1 video frame in a HD video ≈ 1920x1080 x 3 components x 4 bytes ≈ 24MB
1 second of HD video ≈ 25 * 24MB ≈ 600MB
- Avoid copies in/across memory (e.g. CPU, GPU, Worker, WebAssembly)
- Expose as raw data as possible (e.g. YUV or ARGB for decoded video)
- Leverage processing power (e.g. workers, WebAssembly, GPU, ML)
- Difficulty to create inefficient processing pipelines
- Control over processing speed (real-time or faster than real-time)
- Stream processing (as opposed to processing of entire files)
- Processing should not de-synchronize media streams
If existing solutions are not efficient enough, what would make media processing efficient?
It essentially boils down to sizes.
To give you a rough idea, a single video frame in a full HD video weighs about 24MB, so processing a full HD video in real time means processing about 600MB of raw data per second.
Any efficient solution should avoid having to manipulate or copy bytes around, while allowing to leverage the CPU, the GPU, WebAssembly, Machine Learning hardware, while also giving some knobs to control the processing speed and, last but not least, while also keeping related tracks synchronized!
That's a lot of requirements and the list is not even exhaustive.
That typically explains why decoded media has remained opaque until now.
It is hard to expose decoded media on the Web!
WebCodecs
- Efficient access to built-in (software and hardware) media encoders/decoders
- Exposes needed media processing hooks
It is hard, but it is not impossible.
The WebCodecs API, initially proposed by Google, aims at providing access to built-in media encoders and decoders.
As a by-product, the API exposes decoded frames, allowing applications to hook into the output of the API to process these frames.
Note that the API does not take care of the mux/demux operations, which will have to be done by the application through some JavaScript or WebAssembly code.
This shouldn't be a problem in practice, these operations are not really heavy on the CPU.
WebCodecs — status
- Interfaces being shaped as this talk is recorded
- Lots of open questions, e.g. around integration with other specs such as:
- Not based on WHATWG Streams
- Incubation in WICG, potential deliverable for Media WG
The API is very much in flux, so it's hard to assess anything on WebCodecs for now.
There are a number of open questions, starting with “how can WebCodecs be integrated with other specs?
”.
WebAssembly, MSE, WebRTC, WebXR for rendering video in immersive contexts.
Well, and of course, the integration question also exists for machine learning APIs.
It's worth noting that the WebRTC Working Group has started to work WebRTC Insertable Media using Streams that should eventually build on top of WebCodecs.
By the way, it seems worth noting that WebCodecs is not based on WHATWG Streams.
That was the initial goal, but it turned out to be very difficult.
So having a way to process WHATWG Streams does not necessarily mean that it will be de facto possible to process media streams.
Incubation takes place in the Web Platform Incubator Community Group.
The Media Working Group may adopt and standardize the proposal once it's ready.
Other media features and specs that may impact processing
- Codec switching in MSE
HTMLVideoElement.requestVideoFrameCallback()
- Immersive videos: 360°, volumetric, 3D
- Media Timed Events
- Encrypted media
See the Overview of Media Technologies for the Web document for details
Before I conclude, I'd like to raise a few other media technologies and proposals that may or may not affect the way processing gets done in the future.
For instance, the Media Working Group standardizes a codec switching feature for MSE, for instance to allow to insert an ad which may be in low resolution in the middle of a 4K video.
How would such a switch affect processing?
In some “processing” scenarios, all that may be required is rendering an overlay on top of a video without touching the video per se.
That is precisely what the HTMLVideoElement.requestVideoFrameCallback proposal is offering.
How does that affect everything that I presented so far?
Flat videos are no longer the only videos in town.
Other types of videos get rendered for VR/AR and more immersive experiences.
This includes 360 videos, volumetric videos, and more generally 3D playback experiences.
How does processing apply to these cases?
Also, I have only focused on audio and video tracks, but media content also contains events tracks that are more and more used to control the playback experience.
Media industries are putting a lot of efforts these days to standardize solutions to expose media timed events to applications.
The generic idea is to rely on the user agent to do the work.
If WebCodecs requires applications to demux media content themselves, then what happens to events tracks?
One final note on encrypted media.
In general, the goal there is to hide the bytes from the applications, so encrypted media is typically not a good candidate for application-controlled processing operations.
There may still be scenarios where the ability to process encrypted media in a separate sandbox could be useful.
So, something that could be worth keeping in mind.
All these features and others are mentioned in the Overview of Media Technologies for the Web, which I invite you to take a peak at.
Conclusion
- Exposing decoded frames to applications is easier said than done
- No satisfactory solution for media processing today
- WebCodecs to the rescue!
- Coordination needed for integrating WebCodecs in other technologies
Thank you!
So, what I have tried to convey here is that exposing decoded media frames to applications is doable but not easy, or not as easy as it might sound.
There is no satisfactory solution today.
WebCodecs seems very promising but there are lots of open questions that still need to be discussed.
In order to be successful, the work on WebCodecs would greatly benefit from coordination and engagement from people involved in other technologies.
And that includes people looking at exposing machine learning capabilities to Web applications!
OK, I'm done, thank you, bye!