This document provides an analysis of the Media & Entertainment Industry in relationship with the Web, and identifies shorter- and longer-term trends that could influence Web technologies.

Introduction

On the Web, HTML5 [[HTML]], Media Source Extensions [[media-source]], Encrypted Media Extensions [[encrypted-media]], and captioning languages WebVTT [[WEBVTT]] and Timed Text Markup Language [[ttml1]] provide the building blocks that enable continuous media scenarios. These technologies, deployed in main Web browsers today, are used daily to stream content across the globe. Other technologies have been or are being developed to enable other types of continuous experiences such as games.

The goal of this document is to review on-going trends in the Media & Entertainment sector, with a view to identifying technology needs that could lead to the standardization of new Web technologies on the Web, and/or to revisions of existing ones.

Scope

The scope of this document is Media & Entertainment, defined here as the part of the industry that makes business entertaining people through continuous experiences.

Add examples of continuous experiences, and examples of experiences that are out of scope of this document.

The term continuous experiences refers to applications that somehow engage the user into some consumption/interaction rhythm. In other words, continuous experiences feature immersive content that captures the user's attention and that has an internal timeline. Experiences based on continuous media are obvious examples of continuous experiences, but note the definition goes beyond and also encompasses interactive content where the user's behavior influences the progression along the internal timeline, such as Virtual Reality (VR) experiences and games.

For a given content, the term timeline describes a mapping from time to positions in that content. When content is continuous media, the timeline is the media timeline. This document considers other types of content, where the notion of position may be more abstract. For instance, the notion of position in games may be the cumulated history of moves and events that happened since the game started. Content is said to have an internal timeline if it has a timeline that is a continuous function of time, meaning when progress along the timeline is not purely triggered by user interaction but also changes on its own in absence of it.

The term continuous media is used as defined in the W3C Media & Entertainment Interest Group Charter to mean videos, sound recordings, and their associated technologies such as timed text.

An overview of Media & Entertainment

There are many ways to approach Media & Entertainment. This section looks at the media pipeline, and at categories of content coupled with content consumption mechanisms. A quick look at money flows completes this overview.

The media pipeline

The media pipeline is defined as all the steps needed for a user to experience media content. These steps consist of: pre-production, production, post-production, distribution, and rendering. Each step can be further divided into sub-steps and concepts. Some of them are listed below. This document does not attempt to set precise boundaries and definitions for each of them:

  1. pre-production: screenplay, storyboard, scheduling, etc.
  2. production: media capture, production studio, etc.
  3. post-production: artistic processing to produce a mezzanine file. Includes audio/video mixing, captioning, telecine, special effects, etc.
  4. distribution: transcoding of the mezzanine file contents, audio/video/subtitles association, content protection, etc. to produce container format variants for different distributors, localities and distribution methods; actual distribution through broadcasting or streaming, ad-insertion, etc.
  5. rendering: theater, video player, XR headset, digital signage, etc.

The process of creating media assets, also known as content production, consists of the first three steps of the media pipeline (pre-production, production and post-production).

Media content

Content produced by the Media & Entertainment can be divided into three main categories of content:

These three main categories of content roughly map to three main content consumption mechanisms:

This document will argue that there is a global trend towards convergence of these three main content consumption mechanisms.

Business models

Is describing main money flows a good way to present the industry? Should the document rather (or also) present the different actors (content producers, broadcasters, CDNs, solution providers, device manufacturers, etc.) and e.g. present historical facts that explain regional aspects (public services, technology regulations, etc.)?

Money-wise, content is king. From a high-level perspective, most of the money spent in Media & Entertainment goes to content production. A single movie may cost hundreds of millions of dollars to produce. Dozens of millions of dollars can be spent on video games, and many other media productions cost millions of dollars. Most of these costs are incurred for production and post-production.

To a lesser extent, the industry also invests a good amount of money in distribution mechanisms, notably for the deployment of relevant network infrastructures such as Content Delivery Networks (CDN), and of technologies that allow to encode high quality content with a minimal footprint.

Money gets made through:

Media & Entertainment on the Web

In the past few years, investments on the Web have focused on enabling distribution and rendering of media content within Web applications. This explains the creation of the HTMLMediaElement interface in HTML as well as the work on Media Source Extensions (MSE) [[media-source]] to enable adaptive streaming, on Encrypted Media Extensions (EME) [[encrypted-media]] to protect media assets that might have cost millions of dollars to produce, and on captioning languages Timed Text Markup Language (TTML) [[ttml1]] and WebVTT [[WEBVTT]]. With these technologies, the Web has become a major platform for the distribution and consumption of media content.

Focus has been on on-demand viewing until now, partly because of the rise of SVOD, praised by Web consumers used to selecting the content they browse, partly because on-demand distribution is easier to address technology-wise than live distribution, and partly because of legacy: the broadcasting industry predates the Web, adoption of new technologies and switch to other business models takes time when you already have an established business.

Media companies have also been looking at second screen scenarios. This matches W3C's second screen activity to develop the Presentation API [[presentation-api]] and Remote Playback API [[remote-playback]] specifications. The key here is to agree on an open stack of protocols to discover and control second screens. This is what the Second Screen Community Group is working on. The main driving use case for this work is the ability to use one's smartphone to control and stream content to a large screen, possibly through an HDMI dongle. Other second screen scenarios, e.g. that start from broadcast content, are being investigated too.

Performance of web applications have also vastly improved over the years, and allow the creation of rich, complex and interactive applications. JavaScript runtimes have become much more efficient and Web Workers [[WebWorkers]] allow background execution. That said, computation power available to Web applications remains constrained compared to native applications, with processing restricted to the CPU, and inter-thread communication mostly restricted to message-passing.

The pre-production, production and post-production phases of the media pipeline, which happen before distribution and rendering, have not been a priority for media companies at W3C so far. This is by no means surprising as W3C focuses on user agent technologies. Notable exceptions are TTML, which can be used, and is used, as an interchange format, and the work on peer-to-peer technologies [[WebRTC]] which, by definition, spans most steps of the media pipeline, from capture to rendering.

Current trends

The Media & Entertainment industry seems to be pursuing a number of trends in parallel. Some of the trends below are more advanced and explicit than others.

Any missing trend? Any trend that should be dropped? Are trend descriptions clear enough?

Reduce device fragmentation

The Media & Entertainment industry has embraced Web technologies as a way to generate interactive content along with continuous media, and user interfaces. All IP-connected and media-focused Consumer Electronic (CE) devices such as TV displays and set-top boxes now embed Web browsers.

One problem is that the Web has now moved to an evergreen model [[EVERGREEN-WEB]]: technologies evolve daily and new versions of Web browsers are deployed every few weeks to regular computering devices such as laptops, tablets and smartphones.

However, the Media & Entertainment industry needs more stability:

The fragmentation that exists across devices currently impedes the generalization of scenarios that mixes continuous media and interactive content. The Media & Entertainment industry has invested in various effort over the years to reduce that fragmentation and define interactive TV systems (e.g. ATSC, HbbTV, Hybridcast). Work in the Web Media API Community Group on a Web Media API specification [[WEBMEDIAAPI]], in collaboration with the CTA WAVE Project, is on-going to define a baseline of Web technologies supported across all CE devices and a test suite based on Web platform tests, that can be used to certify compliant products.

Improve content quality

Encoding, processing, decoding, rendering, memory, storage, and network capabilities are all on the rise. They allow content providers to produce and distribute higher quality content, using techniques to extend the color space and improve content resolution, e.g. High Dynamic Range (HDR), wide-gamut, Ultra-High Definition (UHD) and 4K, to propose a theater-like experience. That trend affects both audio and video.

While some of these changes happen at lower levels than the application level, some warrant changes within existing Web technologies. Typically, support for a wider color space requires updates to CSS and canvas technologies to encode new colors, as well as new features to describe mapping levels when an application blends content e.g. media content in HDR with an interface that uses RGB.

Move to IP

The generic move to IP, started a few years ago, is still ongoing. It started with distribution but progressively affects production and post-production as well.

Distribution over IP

Unicast distribution over IP is progressively replacing traditional broadcast technologies. This trend is further accentuated by the fact that the bandwidth available for broadcasting is fading away as underlying radio frequencies get re-affected to 4G and 5G networks, while at the same time needs for bandwidth increase with improvements to the quality of content (e.g. UHD, HDR).

CDNs have become an essential component of distribution as a result, with various past or on-going works on technologies that enable efficient distribution and storage of media content next to the user, and streaming of media content over fluctuating networks. This includes work on common formats such as Common Media Application Format [[MPEGCMAF]] and Common Encryption [[MPEGCENC]], as well as renewed work on Royalty-Free video codecs [[AV1]]. This also includes now widely used mechanisms to stream content efficient over HTTP, e.g. Dynamic Adaptive Streaming over HTTP [[MPEGDASH]] and Media Source Extensions [[MSE]].

The Media & Entertainment industry is exploring other distribution mechanisms, including the use of peer-to-peer technologies to distribute media content or the use of multicast, with a view to further reducing bandwidth needs and distribution costs, as well as to address scaling issues and reduce the overall latency when distributing live content to millions of users at once.

Most of these activities target lower levels than the application level, and are therefore usually out of scope for W3C. However, most of them need to surface one way or the other to Web applications so that they can implement algorithms based on them. The Media Source Extensions [[MSE]] specification is an obvious example. As such, new requirements that may warrant work on new Web technologies or on new features for existing ones are likely to arise.

Production over IP

On top of distribution, video over IP is also being considered as a replacement for Serial Digital Interface (SDI). This move allows to envision more flexible methods of production. For instance, the possibility to produce content remotely without having to send an Outside Broadcast truck (OB Van) can significantly reduce the costs of production for live events. One of the main challenges in that area is the need for high bandwidth connections between the production studio and the venue, so that high-quality audio/video streams can be transmitted, but the democratization of optical fiber means that this is no longer a blocker.

This triggers work on software interfaces (typically REST-based interfaces) to interact between components in a fully-networked architecture, e.g. the work on Networked Media Open Specifications (NMOS) done in the Advanced Media Workflow Association (AMWA).

This move may not create standardization needs at the application level, Web browsers being mostly used to create relatively simple user interfaces to drive these services. However, it seems worth noting that some companies are also exploring the use of Web browsers as an authoring platform for media content, which results in much more demanding requirements that are not yet fulfilled in today's Web browsers, such as the ability to seek a video frame by frame.

Cloud-based processing

As production moves to IP and as hardware becomes a commodity, the need to use specialized hardware to perform various production and post-production steps on media content is progressively disappearing, replaced by the need to use specialized software, which can run in the cloud (SaaS and Paas).

Here as well, REST-based interfaces are needed to interact with these services, as done in the Media Cloud and Microservice Architecture (MCMA) project at EBU, which develops simplified REST APIs, some open-source glue code and guidelines to integrate microservices and notably AI services (speech to text, translation, celebrity identification, etc.) in cloud-based media processing workflows to generate metadata.

Personalize content

While on-demand viewing is often opposed to linear viewing, it seems interesting to note that there is an on-going convergence between these two types of content consumption mechanisms:

Also, while there remains a clear difference between pre-recorded content and live content, monetization schemes for pre-recorded content can sometimes be quite close to those for live content, as the fear of missing out (FOMO) will lead users to watch newly available content as early as possible. Converserly, live content, especially for shows, may be pre-recorded content in disguise, and monetized as such, e.g. as a way to drive the audience for the next live show.

This convergence means that Media & Entertainment companies are essentially trying to create a user specific linear viewing experience. Technology-wise, creating personalized content imposes requirements on the content itself, notably on its metadata, on mechanisms to measure and analyze the user's experience, and on mechanisms to create an uninterrupted stream of media content out of heterogeneous media assets.

Content costs money to produce and consumption on the Web is often ad-based. One of the focus areas for this trend is on improving Web advertising, noting that the neverending fight between content providers and ad-blockers is detrimental to everyone. Audience measurement is also way too complex and still error-prone, resulting in ads being displayed “for free”. To improve ads conversion rates, content distributors attempt to customize ads per user.

Explore VR/AR and 360° videos scenarios

In parallel to the previous trends, media companies are investigating Virtual Reality / Augmented Reality (VR/AR) scenarios, starting with 360° video programs. From a pure media perspective, this remains niche and a research space for now. For instance, as opposed to the trend to improve content quality (see ), the push for VR and 360° videos does not really start with content aimed at movie theaters, where most of the money gets spent. That push clearly exists at the production phase though, where real actors, captured in 3D, perform more and more in front of a green background.

From an end-user perspective, the VR/AR push will more likely come from interactive content, as VR/AR devices create immersive viewing experiences by essence.

That said, VR/AR may take different forms. Immersive live experiences are being prototyped that e.g. allow users to experience a concert or a sporting event as though they were in the stadium. Additional requirements related to media content need to be fulfilled technology-wise to enable such scenarios.

Next Big Thing

Is the depicted convergence too broad? Too specific? Far-fetched?

Combining identified trends together makes it possible to sketch an horizon for Media & Entertainment. This exercise roughly goes under the name of Next Big Thing in C-level circles. It should obviously be taken with a grain of salt:

As far as Media & Entertainment is concerned, the Web platform seems to be in-between Next Big Things. This appears more clearly when looking at trends from a technology perspective. With the exception of VR/AR, the trends mentioned above (see ) are evolutionary in nature: technologies that enable these trends already exist, and additional features that may still be required will appear as incremental improvements to well-deployed technologies. One example is the on-going incubation (when this document is written) of a codec switching feature for MSE to enable seamless ad-insertion scenarios (see ).

In other words, the Web platform has already delivered on its Next Big Thing promise to become a major platform for continuous media experiences. This does not mean that these trends shoud be viewed as minor or as low-priority. Small improvements may end up disrupting the entire Media & Entertainment ecosystem. That said, these changes and disruptions would just confirm the predominent role that the Web platform has taken for the consumption of continuous media.

Looking ahead, this document predicts that the Next Big Thing will be the convergence of the three main content consumption mechanisms. The convergence between on-demand viewing and linear viewing has already been pointed out (see ). That convergence is triggered by a will to offer better user experiences and keep the user engaged. By definition, immersive viewing has the potential to create the most engaging experiences. Up until now though, immersive viewing has remained largely on the side: interactive content follows different production paths, and the hypothetical immersion has been essentially limited to keyboard/gamepad interactions over a rendering display. Innovations in that space have been the prerogative of high-end and native devices (e.g. game consoles). The availability of XR headsets and of ever-more natural interaction mechanisms (e.g. voice, gestures), the democratization of devices that can capture spatialized renderings of a live scene, general improvements to performances in all domains, coupled with the power of the Web as a secure platform to handle interactions with users, and as a social and sharing platform, suggest that the Web platform will be at the heart of the convergence of content consumption mechanisms in the future.

The convergence between continuous media and interactive content is neither easy to achieve nor necessarily easy to conceptualize. In usual interactive content, the user is immersed in scenes that are generated on-the-fly. The user can navigate the scene freely because the generation is live. The situation is reversed with continuous media where content is assembled following the instructions of a director and where navigation is by definition limited. The suggested convergence would combine the two worlds to immerse the user into a world that features both directed scenes and freedom of navigation and interaction. This convergence would allow scenarios such as:

Include a more concrete use case scenario to make the convergence more tangible.

Requirements for the Web platform

The identification of precise requirements to enable the convergence of content consumption mechanisms described in the previous section is out of scope for this document. High-level requirements include:

Is the list correct? Are there other high-level requirements?