W3C Workshop on Web and Machine Learning

Machine Learning and Web Media - by Bernard Aboba (Microsoft)

Previous: Privacy-first approach to machine learning All talks Next: Opportunities and Challenges for TensorFlow.js and beyond

    1st

slideset

Slide 1 of 40

I’m Bernard Aboba, presenting on behalf of the Chairs of the WebRTC Working Group: myself, Harald Alvestrand and Jan-Ivar Bruaroey on the topic of Machine Learning and Web Media.

The Pandemic of 2020 has been a pivotal moment for the world, with many institutions experiencing unprecedented stress.

Amidst all the tragedy, we have seen an unparalleled level of user-driven innovation as consumers and businesses struggle to survive and perhaps even thrive, leading to a decade’s worth of innovation in only a few months, in areas as diverse as politics, art, entertainment and sports.

Yogi Berra said, “you can observe a lot, just by watching”.

What have you observed?

Here are some of the things in my scrapbook.

Live theatre has been particularly hard hit during the pandemic.

Rather than cancelling their 2020-2021 season, Tacoma Little Theatre has chosen to move it online.

Their first online production, of “Robin Hood”, utilized custom backgrounds for scenery, combining conferencing (for the live production) with Youtube for archiving and subsequent streaming.

Typically, custom backgrounds are implemented using machine learning algorithms that locally operate on captured video, extracting human forms which are then overlaid on the selected backgrounds before being encoded for transmission.

During the pandemic, sporting matches have had to be conducted in the absence of fans.

To incorporate fans back into the game, the NBA collaborated with Microsoft to introduce “Together Mode” where video from fans are combined so as to make it seem like they are viewing the game courtside.

As with the previous example, this one involves machine learning algorithms operating locally on captured video prior to encoding and transmission to a server producing a composite video including both the fans as well as the game.

Machine learning algorithms in these use cases operate on captured media prior to encoding or transmission.

For audio, machine learning can be used for noise suppression, and for video it might provide for background removal, “together mode”, or “funny hats”.

Processing may also occur on a centralized server acting as a receiver, such as production of a composite video.

On the local system, one proposal for obtaining access to raw video is to add a method on a MediaStreamTrack.

APIs that provide access to encoded media such as Insertable Streams and WebCodecs operate at a different place in the pipeline, after encoding (on the sender) or prior to decode (on the receiver).

Since raw video is considerably larger than encoded video, and the processing can be quite different, the performance of a Transform stream operating on encoded media within Insertable Streams may not be sufficient for machine learning algorithms operating on raw media.

Performance across the pipeline is an important requirement: capture, application of machine learning models, encoding and transmission.

It is desirable for each pipeline stage to operate on buffers provided by the previous stage, avoiding memory copies.

For example, if the capture device can provide buffers to the machine learning algorithm (perhaps a GPU buffer?) without extraneous copies.

After processing by machine learning algorithms, WebCodecs would encode and WebTransport would perform network I/O, also ideally without extraneous copies.

The Insertable Streams and WebCodecs proposals share data structures such as the representation of audio and video frames.

So even though WebCodecs does not use the WHATWG Streams API used by Insertable Streams and WebTransport, WebCodecs is still leveraging the Insertable Streams implementation experience.

A few words about AV1 support in realtime communications.

AV1 offers not only improved compression efficiency but also a new image format (AVIF) likely to be widely adopted in browsers.

The higher encoding complexity of AV1 makes performance considerations particularly important, but there are proposals to enable software encoders to be practical before hardware acceleration becomes widely available.

These include capability advertisement to allows AV1 to be only used for decode; mixed-codec simulcast, allowing AV1 to be used to encode low bitrate video with other codecs used for higher bitrate encodings; content-hints, allowing AV1 to be used for screen content coding (often at low framerates), and scalable video coding.

Thank you for listening to this presentation.

In addition to the participants and Chairs of the WebRTC Working Group we would like to thank Dom who provided feedback on the presentation and helped put it together.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Privacy-first approach to machine learning All talks Next: Opportunities and Challenges for TensorFlow.js and beyond

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.