This document defines how a user's display, or parts thereof, can be used as the source of a media stream using getDisplayMedia, an extension to the Media Capture API [[!GETUSERMEDIA]].

This document is not complete. It is subject to major changes and, while early experimentations are encouraged, it is therefore not intended for implementation.

Introduction

This document describes an extension to the Media Capture API [[GETUSERMEDIA]] that enables the acquisition of a user's display, or part thereof, in the form of a video stream. This enables a number of applications, including screen sharing using WebRTC [[WEBRTC]].

This feature has signficant security implications. Applications that use this API to access information that is displayed to users could access confidential information from other origins if that information is under the control of the application. This includes content that would otherwise be inaccessible due to the protections offered by the user agent sandbox.

This document concerns itself primarily with the capture of video, but the general mechanisms defined here could be extended to other types of media, of which audio [[GETUSERMEDIA]] and depth [[MEDIACAPTURE-DEPTH]] are currently defined.

This specification defines conformance criteria that apply to a single product: the user agent that implements the interfaces that it contains.

Implementations that use ECMAScript [[ECMA-262]] to implement the APIs defined in this specification must implement them in a manner consistent with the ECMAScript Bindings defined in the Web IDL specification [[!WEBIDL]], as this specification uses that specification and terminology.

Example

The following example demonstrates a request for display capture using the navigator.getDisplayMedia method defined in this document.

try {
  let mediaStream = await navigator.getDisplayMedia({video:true});
  videoElement.srcObject = mediaStream;
} catch (e) {
  console.log('Unable to acquire screen capture: ' + e);
}

Terminology

This document uses the definition of MediaStreamTrack, and ConstrainablePattern from [[!GETUSERMEDIA]].

Screen capture encompasses the capture of several different types of screen-based surfaces. Collectively, these are referred to as display surfaces, of which this document defines the following types:

This document draws a distinction between two variants of each type of display surface:

Some operating systems permit windows from different applications to occlude other windows, in whole or part, so the visible display surface is a strict subset of the logical display surface.

The terms permission, retrieve the permission state, prompt the user to choose, and create a permission storage entry are defined in [[!permissions]].

Capturing Displayed Media

Capture of displayed media is enabled through the addition of a new getDisplayMedia method on the Navigator interface, that is similar to getUserMedia [[!GETUSERMEDIA]].

Though all constraints are supported for display device capture, not all constraints will have an effect. Usually, this is because a display device cannot alter its behavior in a way that the constraint implies, just as a video camera cannot alter volume.

For instance, it is unlikely that setting the following constraints will have any useful effect (other than an overconstrained error): aspectRatio, facingMode, and echoCancellation.

See for information on the deviceId and groupId constraints.

Navigator Additions

partial interface Navigator {
    Promise<MediaStream> getDisplayMedia (optional MediaStreamConstraints constraints);
};
getDisplayMedia

Prompts the user for permission to live-capture their display.

This method is similar to getUserMedia, except that it acquires media from one display device chosen by the end-user each time. The user agent MUST let the end-user choose which display surface to share out of all available choices every time, and MUST NOT use constraints to limit that choice. Instead, constraints MUST be applied to the media chosen by the user, only after they have made their selection. This prevents an application from influencing the selection of sources, see for details.

In addition to drawing from a different set of sources and requiring user selection, getDisplayMedia also differs from getUserMedia in that "granted" permissions cannot be persisted.

When the getDisplayMedia() method is called, the User Agent MUST run the following steps:

  1. Let constraints be the method's first argument.

  2. For each member present in constraints whose value, value, is a dictionary, run the following steps:

    1. If value contains a member named advanced, return a promise rejected with a newly created TypeError.

    2. If value contains a member which in turn is a dictionary containing a member named either min or exact, return a promise rejected with a newly created TypeError.

  3. Let requestedMediaTypes be the set of media types in constraints with either a dictionary value or a value of true.

  4. If requestedMediaTypes is the empty set, set requestedMediaTypes to a set containing "video".

  5. If the current settings object's responsible document is NOT fully active, return a promise rejected with a DOMException object whose name attribute has the value InvalidStateError.

  6. If the current settings object's responsible document is NOT allowed to use the feature indicated by Feature Policy [TBD], return a promise rejected with a DOMException object whose name attribute has the value SecurityError.

  7. Let originIdentifier be the current settings object's responsible browsing context's [[!HTML52]] top-level browsing context's active document's origin.

  8. If the current settings object's origin is different from originIdentifier, set originIdentifier to the result of combining originIdentifier and the current settings object's origin.

  9. Let p be a new promise.

  10. Run the following steps in parallel:

    1. For each media type T in requestedMediaTypes,

      1. If no sources of type T are available, reject p with a new DOMException object whose name attribute has the value NotFoundError.

      2. Retrieve the permission state for obtaining sources of type T in the current browsing context. If the permission state is "denied", jump to the step labeled PermissionFailure below.

    2. Optionally, e.g., based on a previously-established user preference, for security reasons, or due to platform limitations, jump to the step labeled Permission Failure below.

    3. For the origin identified by originIdentifier, prompt the user to choose a display device, with a PermissionDescriptor named "display", resulting in a set of provided media.

      The provided media MUST include precisely one track of each media type in requestedMediaTypes. The devices chosen MUST be the ones determined by the user. Once selected, the source of a MediaStreamTrack MUST NOT change.

      User Agents are encouraged to warn users against sharing browser display devices as well as monitor display devices where browser windows are visible, or otherwise try to discourage their selection on the basis that these represent a significantly higher risk when shared.

      If the result of the request is "granted", then for each device that is sourcing the provided media, using a stable and private id for the device, deviceId, set [[\devicesLiveMap]][deviceId] to true, if it isn’t already true, and set the [[\devicesAccessibleMap]][deviceId] to true, if it isn’t already true.

      The User Agent MUST NOT create a permission storage entry with a value of "granted".

      If the result is "denied", jump to the step labeled Permission Failure below. If the user never responds, this algorithm stalls on this step.

      If the user grants permission but a hardware error such as an OS/program/webpage lock prevents access, reject p with a new DOMException object whose name attribute has the value NotReadableError and abort these steps.

      If the result is "granted" but device access fails for any reason other than those listed above, reject p with a new DOMException object whose name attribute has the value AbortError and abort these steps.

    4. Let stream be the MediaStream object for which the user granted permission.

    5. Run the ApplyConstraints algorithm on all tracks in stream with the appropriate constraints. Should this fail, let failedConstraint be the result of the algorithm that failed, and let message be either undefined or an informative human-readable message, and then reject p with a new OverconstrainedError created by calling OverconstrainedError(failedConstraint, message).

    6. Resolve p with stream and abort these steps.

    7. Permission Failure: Reject p with a new DOMException object whose name attribute has the value NotAllowedError.

  11. Return p.

Min and exact constraints are disallowed by getDisplayMedia(). The max constraint type lets a web application provide a maximum envelope for constrainable properties like width and height, should the end-user resize a window or browser surface while it is being captured.

Constraining Display Surface Selection

Not accepting constraints for source selection means that getDisplayMedia only provides fingerprinting surface that exposes whether audio, video or audio and video display sources are present. (This is a fingerprinting vector.)

New Constraints for Captured Display Surfaces

Following new constraints are defined that allow an application to observe the properties of the selected display surface. Since the source of media cannot be changed after a MediaStreamTrack has been returned and constraints do not affect the selection of display surfaces, these constraints cannot be changed by an application.

The displaySurface constraint allows an application to observe the type of display surface that is being captured.

The logicalSurface constraint allows an application to observe whether the surface that is captured is a logical display surface, rather than the visible display surface.

The cursor constraint allows an application to specify if the cursor should be included in the captured display.

Extensions to MediaTrackConstraintSet

partial dictionary MediaTrackConstraintSet {
             ConstrainDOMString displaySurface;
             ConstrainBoolean   logicalSurface;
             ConstrainDOMString   cursor;
};
displaySurface of type ConstrainDOMString

The type of display surface that is being captured. This assumes values from the DisplayCaptureSurfaceType enumeration.

logicalSurface of type ConstrainBoolean

A value of true indicates capture of a logical display surface; a value of false indicates a capture capture of a visible display surface.

cursor of type ConstrainDOMString

Assumes values from the CursorCaptureConstraint enumeration that determines if and when the cursor is included in the captured display surface.

DisplayCaptureSurfaceType

The DisplayCaptureSurfaceType enumeration describes the different types of display surface.

enum DisplayCaptureSurfaceType {
    "monitor",
    "window",
    "application",
    "browser"
};
Enumeration description
monitor a monitor display surface, physical display, or collection of physical displays
window a window display surface, or single application window
application an application display surface, or entire collection of windows for an application
browser a browser display surface, or single browser window

CursorCaptureConstraint

The CursorCaptureConstraint enumerates the conditions under which the cursor is captured.

enum CursorCaptureConstraint {
    "never",
    "always",
    "motion"
};
Enumeration description
never a never cursor capture constraint omits the cursor from the captured display surface.
always a always cursor capture constraint includes the cursor in the captured display surface.
motion a motion cursor capture constraint includes the cursor in the captured display surface when the cursor/pointer is moved. The captured cursor is removed when there is no further movement of the pointer/cursor for certain period of time, as determined by the user agent.

Device Identifiers

Each potential source of capture is treated by this API as a discrete media source. However, display capture sources MUST NOT be enumerated by enumerateDevices, since this would reveal too much information about the host system.

Display capture sources therefore cannot be selected with the deviceId constraint, since this would allow applications to influence selection; setting deviceId constraint can only cause the resulting MediaStreamTrack to become overconstrained.

A display capture source is represented in the MediaStreamTrack API as having deviceId and groupId attributes that are randomized each time a MediaStreamTrack is connected. These values cannot duplicate any existing values.

This exposed deviceId identifier is not to be confused with the stable and private id of the same name used in algorithms to implement privacy indicators.

Privacy Indicator Requirements

This specification extends the Privacy Indicator Requirements of getUserMedia to include getDisplayMedia.

References in this specification to [[\devicesLiveMap]], [[\devicesAccessibleMap]], and [[\kindsAccessibleMap]] refer to the definitions already created to support Privacy Indicator Requirements for getUserMedia.

For each kind of device that getDisplayMedia exposes, using a stable and private id for the device, deviceId, set kind to "Display" + kind, and do the following:

Then, given the new definitions above, the requirements on the User Agent are those specified in Privacy Indicator Requirements of getUserMedia.

Even though there's a single permission descriptor for getDisplayMedia, the above definitions distinguish by kind to enable user agents to implement privacy indicators that show the end-user the specific kinds of display sources that are being shared at any point.

Since this specification forbids user agents from persisting "granted" permissions, only the "Live" indicators are significant.

The User Agent MUST NOT fire the devicechange event based on changes in the set of available sources from getDisplayMedia.

Security and Permissions

This section is informative; however, it notes some serious risks to platform security if the advice it contains are not adhered to.

This is consistent with other documents, but the absence of strong normative language here is a little worrying.

The risks to user privacy and security posed by capture of displayed content are twofold. The immediate and obvious risk is that users inadvertently share content that they did not wish to share, or might not have realized would be shared.

Display capture presents a less obvious risk to the cross site request forgery protections offered by the browser sandbox. Display and capture of information that is also under the control of an application, even indirectly, can allow that application to access information that would otherwise by inaccessible to it directly. For example, the canvas API does not permit sampling of a canvas, or conversion to an accessible form if it is not origin-clean [[2DCONTEXT]].

This issue is discussed in further detail in [[!RTCWEB-SECURITY-ARCH]] and [[!RTCWEB-SECURITY]].

Display capture that includes browser windows, particularly those that are under any form of control by the application, risks violation of these basic security protections. This risk is not entirely contained to browser windows, since control channels between browser applications and other applications, depending on the operating system. The key consideration is whether the captured display surface could be somehow induced to present information that would otherwise be secret from the application that is receiving the resulting media.

Capturing Logical or Visible Display Surfaces

Capture of logical display surfaces causes there to be a potential for content to be shared that a user is not made aware of. A logical display surface might render information that a user did not intend to expose. This can be more easily recognized if this information is visible. Such means are likely ineffectual against a machine, but a human recipient is less able to process content that appears only briefly.

Information that is not currently rendered to the screen SHOULD be obscured in captures unless the application has been specifically authorized to access that content (this might require elevated permissions).

How obscured areas of the logical display surface are captured to produce a visible display surface capture MAY vary. Some applications, like presentation software, benefit from having obscured portions of the screen render the image that appeared prior to being obscured. Freezing images can cause visual artifacts for changing content, or hide the fact that content is being obscured. Note that frozen portions of a capture can be incorrectly perceived as a bug. Alternatively, obscured areas might be replaced with content that marks them as being obscured, such as a grey color or hatching.

Some systems MAY only capture the logical display surface. Devices with small screens, for instance, do not typically have the concept of a window, and render applications in full screen modes only. These systems might provide a capture of an application that is not currently visible, which could be unusable without capturing the logical display surface.

An important consideration when capturing a window or other display surface that is partially transparent is that content from the background might be shared. A user agent MUST NOT capture content from the background of a captured display surface.

Authorizing Display Capture

This document provides recommends that implementations provide additional limitations on the mechanisms used to affirm user consent. These limitations are designed to mitigate the security and privacy risks that the API poses.

Two forms of consent interaction are described: active user consent and a range of elevated permissions. These are non-normative recommandations only.

Active User Consent

Active user consent is sufficient where there is little or no risk of an application gaining information that the user did not intend to share. These cases can be identified by those where the application that requests capture has no control over what is rendered to the captured display surface.

To prevent applications from limiting the available choices presented to a user with the goal of promoting a particular choice, the getDisplayMedia API does not permit the use of constraints to narrow the set of options presented.

Elevated Permissions

It is strongly advised that elevated permissions be required to access any display surface that might be used to circumvent cross-origin protections for content. The key goal of this consent process is not just to demonstrate that a user intends to share content, but to also to determine that the user exhibits an elevated level of trust in the application that is being granted access.

Several different controls might be provided to grant elevated permissions. This section describes several different capabilities that could be independently granted. A user agent might opt to prohibit access to any capability that requires elevated permissions.

If access to these surfaces is supported, it is strongly advised that any mechanism to acquire elevated permissions not rely solely on simple prompts for user consent. Any action needs to ensure that a decision to authorize an application with elevated privileges is deliberate. For instance, a user agent might require a process equivalent to software installation to signify that user consent for elevated permissions is granted.

An elevated permissions experience could allow the user agent to communicate the risks associated with enabling this feature, or at least to convey the need for augmented trust in the application.

Note that elevated permissions are not a substitute for active user consent. It is advised that user agents still present users with the ability to select what is shared, even for applications that have elevated permissions.

Capabilities Depending on Elevated Permissions

Elevated permissions are recommended as a prerequisite for access to capture of monitor or browser display surfaces. Note that capture of a complete monitor is included because this could include a window from the user agent.

Similarly, elevated permissions are a recommended prerequisite for access to logical display surfaces, where that would not ordinarily be provided.

A user agent SHOULD persist any elevated permissions that are granted to an origin. An elevated permissions process in part relies on its novelty to ensure that it correctly captures user intent.

Feedback and Interface During Capture

Implementations are advised to provide user feedback and control mechanisms similar to those offered users when sharing a camera or microphone, as recommended in [[GETUSERMEDIA]].

It is important that a user be aware that content is being shared when content is actively being captured. User agents are advised to display a prominent indicator while content is being captured. In addition to an indicator, a user agent is advised to provide a means to learn precisely what is being shared; while this capability is trivially provided by an application by rendering the captured content, this information allows a user to accurately assess what is being shared.

In addition to feedback mechanisms, a means to for the user to stop any active capture is advisable.