A perspective on Media & Entertainment for the Web

Abstract

This document provides an analysis of the Media & Entertainment Industry in relationship with the Web, and identifies shorter- and longer-term trends that could influence Web technologies.

An overview of Media & Entertainment

There are many ways to approach Media & Entertainment. This section looks at the media pipeline, and at categories of content coupled with content consumption mechanisms. A quick look at money flows completes this overview.

The media pipeline

The media pipeline is defined as all the steps needed for a user to experience media content. These steps consist of: pre-production, production, post-production, distribution, and rendering. Each step can be further divided into sub-steps and concepts. Some of them are listed below. This document does not attempt to set precise boundaries and definitions for each of them:

pre-production: screenplay, storyboard, scheduling, etc.
production: media capture, production studio, etc.
post-production: artistic processing to produce a mezzanine file. Includes audio/video mixing, captioning, telecine, special effects, etc.
distribution: transcoding of the mezzanine file contents, audio/video/subtitles association, content protection, etc. to produce container format variants for different distributors, localities and distribution methods; actual distribution through broadcasting or streaming, ad-insertion, etc.
rendering: theater, video player, XR headset, digital signage, etc.

The process of creating media assets, also known as content production, consists of the first three steps of the media pipeline (pre-production, production and post-production).

Media content

Content produced by the Media & Entertainment can be divided into three main categories of content:

Pre-recorded content: continuous media that gets produced once in advance of viewing (i.e. not in real time). The production of pre-recorded content may take months or years and may cost up to a few hundreds of millions of dollars. The resulting content can be monetized over a long period of time and across distribution channels (movie theaters, TV, on-demand catalogs) and technologies (HD, HDR, UHD, 3D, etc.). The typical expectation is that pre-recorded content will be watched by a relatively small number of users at once, but that there will be many opportunities to watch the content, so that many users will have watched the content over time.
Live content: continuous media that gets produced and distributed in real-time. Live content is strongly tied to a particular event that people want to follow together and at the same time, or for which knowing the outcome would kill the suspense. Examples include sporting events, music shows, news, shows that interact with the audience (e.g. radio shows). The typical expectation is that live content will be watched once by a large number of people.
Interactive content: content for which progress along its timeline also depends on user's actions. This category typically includes games and XR experiences. Purely interactive content, in other words content that does not have an internal timeline such as data-based content that the user can explore (maps, science data, educational material in MOOCs, and CAD applications used in industries for complex product design) is out of scope for this document.

These three main categories of content roughly map to three main content consumption mechanisms:

On-demand viewing: The user browses a catalog, selects the pre-recorded content she wants to watch, and consumes the chosen content. The user has complete freedom over what she can watch when, but needs to choose.
Linear viewing: The user selects a channel, and watches a linear stream composed of different programs. Historically, TV started with live content only, but it is worth noting that most programs are now pre-recorded content in practice. The main difference with on-demand viewing is that the timing is imposed: the user cannot choose to watch a particular content when she wants, but then she does not need to choose.
Immersive viewing: The user starts an interactive content experience.

This document will argue that there is a global trend towards convergence of these three main content consumption mechanisms.

Business models

Issue 2

Is describing main money flows a good way to present the industry? Should the document rather (or also) present the different actors (content producers, broadcasters, CDNs, solution providers, device manufacturers, etc.) and e.g. present historical facts that explain regional aspects (public services, technology regulations, etc.)?

Money-wise, content is king. From a high-level perspective, most of the money spent in Media & Entertainment goes to content production. A single movie may cost hundreds of millions of dollars to produce. Dozens of millions of dollars can be spent on video games, and many other media productions cost millions of dollars. Most of these costs are incurred for production and post-production.

To a lesser extent, the industry also invests a good amount of money in distribution mechanisms, notably for the deployment of relevant network infrastructures such as Content Delivery Networks (CDN), and of technologies that allow to encode high quality content with a minimal footprint.

Money gets made through:

Intermediary steps: Business-to-business contracts between parties while content progresses along the media pipeline, for instance when a CDN charges a content provider for distribution, or when a company specialized in post-processing charges a production company for its services.
Content licensing: Business-to-business contracts between content providers and content distributors that allow distributors to stream some, or business-to-consumer grants of license for users to watch the content in a certain context. Licensing agreements can have all sorts of restrictions, including time and geo restrictions, restrictions on quality, legal captioning requirements, etc.
Subscriptions from end users which can either be Subscription Video On Demand (SVOD) or Subscription-based Linear (SLIN).
Ads rendered during media playback and/or product placement within the content produced.
Taxes collected by national governments to fund public services such as public broadcasters.
By-products, such as goodies associated with a movie, official championship gears for sport games, toys featured in a cartoon, etc.

Media & Entertainment on the Web

In the past few years, investments on the Web have focused on enabling distribution and rendering of media content within Web applications. This explains the creation of the HTMLMediaElement interface in HTML as well as the work on Media Source Extensions (MSE) [media-source] to enable adaptive streaming, on Encrypted Media Extensions (EME) [encrypted-media] to protect media assets that might have cost millions of dollars to produce, and on captioning languages Timed Text Markup Language (TTML) [ttml1] and WebVTT [WEBVTT]. With these technologies, the Web has become a major platform for the distribution and consumption of media content.

Focus has been on on-demand viewing until now, partly because of the rise of SVOD, praised by Web consumers used to selecting the content they browse, partly because on-demand distribution is easier to address technology-wise than live distribution, and partly because of legacy: the broadcasting industry predates the Web, adoption of new technologies and switch to other business models takes time when you already have an established business.

Media companies have also been looking at second screen scenarios. This matches W3C's second screen activity to develop the Presentation API [presentation-api] and Remote Playback API [remote-playback] specifications. The key here is to agree on an open stack of protocols to discover and control second screens. This is what the Second Screen Community Group is working on. The main driving use case for this work is the ability to use one's smartphone to control and stream content to a large screen, possibly through an HDMI dongle. Other second screen scenarios, e.g. that start from broadcast content, are being investigated too.

Performance of web applications have also vastly improved over the years, and allow the creation of rich, complex and interactive applications. JavaScript runtimes have become much more efficient and Web Workers [WebWorkers] allow background execution. That said, computation power available to Web applications remains constrained compared to native applications, with processing restricted to the CPU, and inter-thread communication mostly restricted to message-passing.

The pre-production, production and post-production phases of the media pipeline, which happen before distribution and rendering, have not been a priority for media companies at W3C so far. This is by no means surprising as W3C focuses on user agent technologies. Notable exceptions are TTML, which can be used, and is used, as an interchange format, and the work on peer-to-peer technologies [WebRTC] which, by definition, spans most steps of the media pipeline, from capture to rendering.

Current trends

The Media & Entertainment industry seems to be pursuing a number of trends in parallel. Some of the trends below are more advanced and explicit than others.

Issue 3

Any missing trend? Any trend that should be dropped? Are trend descriptions clear enough?

Reduce device fragmentation

The Media & Entertainment industry has embraced Web technologies as a way to generate interactive content along with continuous media, and user interfaces. All IP-connected and media-focused Consumer Electronic (CE) devices such as TV displays and set-top boxes now embed Web browsers.

One problem is that the Web has now moved to an evergreen model [EVERGREEN-WEB]: technologies evolve daily and new versions of Web browsers are deployed every few weeks to regular computering devices such as laptops, tablets and smartphones.

However, the Media & Entertainment industry needs more stability:

CE devices are often low-margin products whose software or firmware is rarely updated. Device manufacturers need to minimize the costs of porting Web browser codebases to their devices, and may typically restrict updates to the firmware once a product has shipped to security fixes.
Content providers and distributors want to guarantee the user experience across devices. They need to understand what technologies they can recommend usage of to produce interactive content.

The fragmentation that exists across devices currently impedes the generalization of scenarios that mixes continuous media and interactive content. The Media & Entertainment industry has invested in various effort over the years to reduce that fragmentation and define interactive TV systems (e.g. ATSC, HbbTV, Hybridcast). Work in the Web Media API Community Group on a Web Media API specification [WEBMEDIAAPI], in collaboration with the CTA WAVE Project, is on-going to define a baseline of Web technologies supported across all CE devices and a test suite based on Web platform tests, that can be used to certify compliant products.

Improve content quality

Encoding, processing, decoding, rendering, memory, storage, and network capabilities are all on the rise. They allow content providers to produce and distribute higher quality content, using techniques to extend the color space and improve content resolution, e.g. High Dynamic Range (HDR), wide-gamut, Ultra-High Definition (UHD) and 4K, to propose a theater-like experience. That trend affects both audio and video.

While some of these changes happen at lower levels than the application level, some warrant changes within existing Web technologies. Typically, support for a wider color space requires updates to CSS and canvas technologies to encode new colors, as well as new features to describe mapping levels when an application blends content e.g. media content in HDR with an interface that uses RGB.

Move to IP

The generic move to IP, started a few years ago, is still ongoing. It started with distribution but progressively affects production and post-production as well.

Distribution over IP

Unicast distribution over IP is progressively replacing traditional broadcast technologies. This trend is further accentuated by the fact that the bandwidth available for broadcasting is fading away as underlying radio frequencies get re-affected to 4G and 5G networks, while at the same time needs for bandwidth increase with improvements to the quality of content (e.g. UHD, HDR).

CDNs have become an essential component of distribution as a result, with various past or on-going works on technologies that enable efficient distribution and storage of media content next to the user, and streaming of media content over fluctuating networks. This includes work on common formats such as Common Media Application Format [MPEGCMAF] and Common Encryption [MPEGCENC], as well as renewed work on Royalty-Free video codecs [AV1]. This also includes now widely used mechanisms to stream content efficient over HTTP, e.g. Dynamic Adaptive Streaming over HTTP [MPEGDASH] and Media Source Extensions [MSE].

The Media & Entertainment industry is exploring other distribution mechanisms, including the use of peer-to-peer technologies to distribute media content or the use of multicast, with a view to further reducing bandwidth needs and distribution costs, as well as to address scaling issues and reduce the overall latency when distributing live content to millions of users at once.

Most of these activities target lower levels than the application level, and are therefore usually out of scope for W3C. However, most of them need to surface one way or the other to Web applications so that they can implement algorithms based on them. The Media Source Extensions [MSE] specification is an obvious example. As such, new requirements that may warrant work on new Web technologies or on new features for existing ones are likely to arise.

Production over IP

On top of distribution, video over IP is also being considered as a replacement for Serial Digital Interface (SDI). This move allows to envision more flexible methods of production. For instance, the possibility to produce content remotely without having to send an Outside Broadcast truck (OB Van) can significantly reduce the costs of production for live events. One of the main challenges in that area is the need for high bandwidth connections between the production studio and the venue, so that high-quality audio/video streams can be transmitted, but the democratization of optical fiber means that this is no longer a blocker.

This triggers work on software interfaces (typically REST-based interfaces) to interact between components in a fully-networked architecture, e.g. the work on Networked Media Open Specifications (NMOS) done in the Advanced Media Workflow Association (AMWA).

This move may not create standardization needs at the application level, Web browsers being mostly used to create relatively simple user interfaces to drive these services. However, it seems worth noting that some companies are also exploring the use of Web browsers as an authoring platform for media content, which results in much more demanding requirements that are not yet fulfilled in today's Web browsers, such as the ability to seek a video frame by frame.

Cloud-based processing

As production moves to IP and as hardware becomes a commodity, the need to use specialized hardware to perform various production and post-production steps on media content is progressively disappearing, replaced by the need to use specialized software, which can run in the cloud (SaaS and Paas).

Here as well, REST-based interfaces are needed to interact with these services, as done in the Media Cloud and Microservice Architecture (MCMA) project at EBU, which develops simplified REST APIs, some open-source glue code and guidelines to integrate microservices and notably AI services (speech to text, translation, celebrity identification, etc.) in cloud-based media processing workflows to generate metadata.

Personalize content

While on-demand viewing is often opposed to linear viewing, it seems interesting to note that there is an on-going convergence between these two types of content consumption mechanisms:

On-demand platforms have progressively switched to a more linear model where the user selects the first media content that she wants to watch and where the platform creates a program based on it and possibly based on user's expressed preferences and history. For instance, when a user clicks on a link to a video shared on a social network, the playback experience may not stop at the end of the video. The video provider will rather suggest and possibly automatically play back further video content afterwards, with a view to keeping the user engaged.
Linear content providers are exploring ways to customize content, for instance using object-based modeling to create personalized news stories, or combining broadcast content with pre-recorded content to create personalized radio programs.

Also, while there remains a clear difference between pre-recorded content and live content, monetization schemes for pre-recorded content can sometimes be quite close to those for live content, as the fear of missing out (FOMO) will lead users to watch newly available content as early as possible. Converserly, live content, especially for shows, may be pre-recorded content in disguise, and monetized as such, e.g. as a way to drive the audience for the next live show.

This convergence means that Media & Entertainment companies are essentially trying to create a user specific linear viewing experience. Technology-wise, creating personalized content imposes requirements on the content itself, notably on its metadata, on mechanisms to measure and analyze the user's experience, and on mechanisms to create an uninterrupted stream of media content out of heterogeneous media assets.

Content costs money to produce and consumption on the Web is often ad-based. One of the focus areas for this trend is on improving Web advertising, noting that the neverending fight between content providers and ad-blockers is detrimental to everyone. Audience measurement is also way too complex and still error-prone, resulting in ads being displayed “for free”. To improve ads conversion rates, content distributors attempt to customize ads per user.

Explore VR/AR and 360° videos scenarios

In parallel to the previous trends, media companies are investigating Virtual Reality / Augmented Reality (VR/AR) scenarios, starting with 360° video programs. From a pure media perspective, this remains niche and a research space for now. For instance, as opposed to the trend to improve content quality (see ), the push for VR and 360° videos does not really start with content aimed at movie theaters, where most of the money gets spent. That push clearly exists at the production phase though, where real actors, captured in 3D, perform more and more in front of a green background.

From an end-user perspective, the VR/AR push will more likely come from interactive content, as VR/AR devices create immersive viewing experiences by essence.

That said, VR/AR may take different forms. Immersive live experiences are being prototyped that e.g. allow users to experience a concert or a sporting event as though they were in the stadium. Additional requirements related to media content need to be fulfilled technology-wise to enable such scenarios.

Next Big Thing

Issue 4

Is the depicted convergence too broad? Too specific? Far-fetched?

Combining identified trends together makes it possible to sketch an horizon for Media & Entertainment. This exercise roughly goes under the name of Next Big Thing in C-level circles. It should obviously be taken with a grain of salt:

Trends happen in parallel, not sequentially.
Things tend to evolve over time, and start small. Some stay small, others grow. There will unlikely be a technology that revolutionizes the space in a snap. Still, technologies may disrupt existing businesses over time (the move to IP certainly disrupted the industry for instance).

As far as Media & Entertainment is concerned, the Web platform seems to be in-between Next Big Things. This appears more clearly when looking at trends from a technology perspective. With the exception of VR/AR, the trends mentioned above (see ) are evolutionary in nature: technologies that enable these trends already exist, and additional features that may still be required will appear as incremental improvements to well-deployed technologies. One example is the on-going incubation (when this document is written) of a codec switching feature for MSE to enable seamless ad-insertion scenarios (see ).

In other words, the Web platform has already delivered on its Next Big Thing promise to become a major platform for continuous media experiences. This does not mean that these trends shoud be viewed as minor or as low-priority. Small improvements may end up disrupting the entire Media & Entertainment ecosystem. That said, these changes and disruptions would just confirm the predominent role that the Web platform has taken for the consumption of continuous media.

Looking ahead, this document predicts that the Next Big Thing will be the convergence of the three main content consumption mechanisms. The convergence between on-demand viewing and linear viewing has already been pointed out (see ). That convergence is triggered by a will to offer better user experiences and keep the user engaged. By definition, immersive viewing has the potential to create the most engaging experiences. Up until now though, immersive viewing has remained largely on the side: interactive content follows different production paths, and the hypothetical immersion has been essentially limited to keyboard/gamepad interactions over a rendering display. Innovations in that space have been the prerogative of high-end and native devices (e.g. game consoles). The availability of XR headsets and of ever-more natural interaction mechanisms (e.g. voice, gestures), the democratization of devices that can capture spatialized renderings of a live scene, general improvements to performances in all domains, coupled with the power of the Web as a secure platform to handle interactions with users, and as a social and sharing platform, suggest that the Web platform will be at the heart of the convergence of content consumption mechanisms in the future.

The convergence between continuous media and interactive content is neither easy to achieve nor necessarily easy to conceptualize. In usual interactive content, the user is immersed in scenes that are generated on-the-fly. The user can navigate the scene freely because the generation is live. The situation is reversed with continuous media where content is assembled following the instructions of a director and where navigation is by definition limited. The suggested convergence would combine the two worlds to immerse the user into a world that features both directed scenes and freedom of navigation and interaction. This convergence would allow scenarios such as:

Immersive live experiences where users get projected into the stadium to watch a sporting event, a music show, or to interact with actual players in a game show.
“Choose your own adventure” movies where user actions influence the scenario and outcome.

Issue 5

Include a more concrete use case scenario to make the convergence more tangible.