This document describes an extended set of use cases motivating the development of additional WebRTC APIs, as well as the requirements derived from those use cases.
To motivate the development of WebRTC, the IETF RTCWEB WG developed [[?RFC7478]]. This document describes extended use cases motivating the development of additional WebRTC APIs and the requirements deriving from those use cases. The use cases fall into one of two categories: enhancements to use cases already covered in [[?RFC7478]], and new use cases which are not supported in WebRTC [[?WEBRTC]] without extensions.
The uses cases in this section improve upon use cases described in [[?RFC7478]].
[[?RFC7478]] Section 2.3.12 describes a use case involving a multiparty online game with voice communications. In these scenarios, reducing time to join the game and receive media is important. To minimize this, ICE enhancements are desirable, such as the ability to control candidate gathering and pruning. Also, allowing a participant to broadcast a configuration to a “room” abstraction (maintained on a server), with other room participants responding back directly, avoiding a separate discovery step, minimizes conference establishment time. Also, managing audio quality and latency in a fair manner between multiple connections prevents queue buildup. Supporting this enhancement adds the following requirements:
Requirement ID | Description |
---|---|
N01 | The user agent can control candidate gathering and pruning, limiting the networks on which candidates are gathered, the types of candidates, etc. |
N02 | The user agent must be capable of establishing multiple connections to peers without generating a separate configuration ("offer") for each connection prior to establishment. |
N03 | Congestion control must be able to manage audio quality and latency in a fair manner between multiple connections. |
Experience: This use case has been implemented by a gaming service utilizing [[?ORTC]].
References:[[?RFC7478]] Section 2.3.6 describes a simple communications service where the user changes access network during the session. This use case is enhanced by being able to ring multiple endpoints simultaneously, as well as to re-route media over an alternate path (potentially taking network cost into account) without need for signaling.
An additional enhancement is to provide management of the user experience at both ends of a call during interuptions to the media flow caused by other activities taking higher priority on the smartphone.
Requirement ID | Description |
---|---|
N02 | The user agent must be capable of establishing multiple connections to peers without generating a separate configuration ("offer") for each connection prior to establishment. |
N04 | The ICE agent must be able to maintain multiple candidate pairs and move traffic between them. |
N05 | The ICE agent must be able to take the network cost into account when considering re-routing. |
N30 | The user agent must provide the ability to re-establish media after an interruption. |
N31 | The user agent must provide notification of a media interruption caused by the OS (e.g. GSM incoming call) (c.f. on hold music). and support the use of a datachannel to inform the remote peer of the interruption. |
N32 | The user agent must provide the ability to 'park' a connection such that it can be retrieved and continued by a newly loaded page to prevent accidental 'browsing away' from dropping a call irretrievably. |
References:
Experience: This use case has been implemented by multiple native smartphone apps with call-kit integration.
[[?RFC7478]] Section 2.4.3.1 describes a use case involving Multiparty Video Communications with a central conferencing server. In such a use case, clients with disparate capabilities such as differing bandwidth availability, screen size and maximum displayable frame rate may participate in the same conference. In such a situation it is advantageous to support Scalable Video Coding (SVC). Encoding with temporal scalability is supported by several browsers today and is utilized by most centralized conferencing services.
It is expected that spatial scalability (supported by VP9 and AV1) will become more popular with time. In this use case, if the desired video codec is known beforehand and participants are muted by default (as in a very large meeting), it is desirable to allow new participants to start receiving immediately, without negotiation. Supporting this enhancement adds the following requirements:
Requirement ID | Description |
---|---|
N06 | The user agent must be able to encode and decode video utilizing temporal scalability and (if supported by the chosen codec) spatial scalability. |
N07 | A user agent can receive audio/video without requiring construction of a corresponding sender object. |
N08 | It is possible to select the sending and/or receiving codec as well as rtcp parameters and header extensions without negotiation. |
N09 | The user agent must be able to control robustness (RTX, RED, FEC) applied to individual simulcast and SVC layers. |
N24 | Content Security Policy (CSP) support for WebRTC. |
This use case has been implemented by conferencing services utilizing [[?ORTC]], as well as proprietary additions to [[?WEBRTC]].
Several new uses cases relate to scenarios that cannot be supported in [[?WEBRTC]] without extensions.
Participants in a mesh exchange large files without disruption to audio/video sessions. It is also possible for a participant to send a large file to a user who is not currently online. Supporting this use case adds the following requirements:
Requirement ID | Description |
---|---|
N10 | It must be possible for the user agent to initiate transfer of a large file with a single API operation. |
N11 | The application must be able to signal backpressure (flow control) when receiving data. It must also receive a backpressure signal when sending data. |
N12 | It must be possible for the user agent to transfer data utilizing a congestion control algorithm that does not compete aggressively with audio/video communications. |
N13 | It must be possible to support data exchange in a web, service, or shared worker. Support for service workers allows the page to issue a fetch() which can be resolved in the service worker. |
N24 | Content Security Policy (CSP) support for WebRTC. |
References:
Game streaming involves the sending of audio and video (potentially at high resolution and framerate) to the recipient, along with data being sent in the opposite direction. Games can be streamed either from a cloud service (client/server), or from a peer game console (P2P). It is highly desirable that media flow without interruption, and that game players not reveal their location to each other. Even in the case of games streamed from a cloud service, it can be desirable for players to be able to communicate with each other directly (via chat, audio or video).
Requirement ID | Description |
---|---|
N15 | The application must be able to take steps to ensure a low and consistent latency for audio, video and data under varying network conditions. This may include tweaking of transport parameters for both media and data. |
N36 | An application that is only receiving but not sending media or data can operate efficiently without access to camera or microphone. |
N37 | It must be possible for the user agent's receive pipeline to process video at high resolution and framerate (e.g. by controlling hardware acceleration if necessary). |
N38 | The application must be able to control the jitter buffer and rendering delay. This requirement is addressed by jitterBufferTarget, defined in [[?WebRTC-Extensions]] Section 6. |
Experience: Microsoft's Xbox Cloud Gaming and NVIDIA's GeForce NOW are examples of this use case, with media transported using RTP or RTCDataChannel.
There are streaming applications that require large scale as well as ultra low latency such as auctions. Live audio, video and data is sent to thousands of recipients. Limited interactivity may be supported, such as capturing audio/video from auction bidders. Both the media senders and receivers may be behind a NAT. Peer-to-peer relays are used to improve scalability, with ingestion, distribution and fanout requiring unreliable/unordered transport to reduce latency, and support for retransmission and forward error correction to provide robustness. In this use case, low latency is more important than media quality, so that low latency congestion control algorithms are required, and latency-enducing effects such as head of line blocking are to be avoided. Reception of media via events is problematic, due to lack of coupling between the event loop and the receive window.
Requirement ID | Description |
---|---|
N15 | The application must be able to take steps to ensure a low and consistent latency for audio, video and data under varying network conditions. This may include tweaking of transport parameters for both media and data. |
N39 | A user-agent must be able to forward media received from a peer to another peer. Applications require access to encoded chunk metadata as well as information from the RTP header to provide for timing, media configuration and congestion control. This includes a mechanism for a relaying peer to obtain a bandwidth estimate. |
Experience: |pipe| and Dolby are examples of this use case, with media transported via RTP.
An IoT sensor maintains a long-term connection and seeks to minimize power consumption. Some of the sensor’s data may need to be sent reliable and ordered while other sensors may provide data that can be sent unreliable and unordered or in a partially reliable manner. Such IoT sensors may also produce realtime video or audio data for remote users which are privacy sensitive and may only be accessed by selected devices. This use case adds the following requirements:
Requirement ID | Description |
---|---|
N14 | The application must be able to minimize ICE connectivity checks. |
N15 | The application must be able to take steps to ensure a low and consistent latency for audio, video and data under varying network conditions. This may include tweaking of transport parameters for both media and data. |
N16 | It must be possible to send arbitrary data reliable, unreliable or partially reliable with a specific maximum number of retransmissions or a specific maximum timeout. |
N17 | It must be possible to send arbitrary data ordered or unordered. |
N24 | Content Security Policy (CSP) support for WebRTC. |
N33 | A pre-existing peers must be able to be (re) establish a connection without access to external services in the event of the local network becoming isolated from the wider network. Without compromising e2e security but possibly leveraging pre-shared tokens from a previous connection. |
Reference
Mailing list discussionExperience: Building a respiration monitor that works locally during an internet outage but is also available remotely when the internet connection returns.
New decentralized applications provide P2P services and client-server services for consumption in a browser.
The differences in transport layer semantics make it difficult to share code between the two modes.
This use case has not completed a Call for Consensus (CfC).
Requirement ID | Description |
---|---|
N34 | Ability to intercept the fetch API and service it over a P2P link. One way to do this would be to support data exchange in service workers which can already intercept fetch. |
Experience: Both |pipe| and [Matrix] have implemented systems of this sort.
A virtual reality gaming service utilizing a centralized conferencing server wants to synchronize data with media, using an existing Selective Forwarding Unit (SFU) to distribute the data. This use case adds the following requirements:
Requirement ID | Description |
---|---|
N23 | The user agent must be able to send data synchronized with audio and video. |
N24 | Content Security Policy (CSP) support for WebRTC. |
References:
Mailing list discussionA communications service that manipulates captured media prior to encoding and after decoding to provide effects including:
This use case requires manipulation of raw media from both local and remote sources. Since media processing can be CPU intensive, enabling it to occur off the main thread is important, as is enabling the processing to take advantage of the GPU. This use case adds the following requirements:
This usecase also requires that user agent provide non-discriminatory implementations of facetracking and body tracking algorithms that can be efficiently used by the application. This however is outside the scope of this document.
Requirement ID | Description |
---|---|
N18 | The application must be able to obtain raw media from the capture device in desired formats. |
N19 | The application must be able to insert processed frames into the outgoing media path. |
N20 | The application must be able to obtain decoded media from the remote party. |
N21 | It must be possible to efficiently share media between the main thread and worker threads. |
N22 | It must be possible to do efficient media manipulation in worker threads. |
N24 | Content Security Policy (CSP) support for WebRTC. |
References:
Cloud video conferencing systems have no need to be able to access the cleartext media and text flowing through their servers. Some of these conferencing services desire to be able to promote trust by explicitly showing they do not have access to contents of their users' calls. They are trusted to connect the right people to the conference and to route the packets but they are not trusted to access the audio and video media or text in the call.
Solutions to this problem fall into two major categories: one where the JavaScript comes from a source trusted to see the media contents, and one where it does not.
There are many cases where a system such as WebEx is trusted to connect the members of a conference but has no need to access the contents of the conference. This is true of the majority of conferencing systems on the web today. Just to highlight the scope of this requirement, there are more minutes of WebRTC that are used in conferences where the servers have no need to access the contents (e.g. where audio is forwarded rather than mixed) than any other use of WebRTC audio by orders of magnitude. This is one of the primary use case for WebRTC audio and accounts for billions of minutes per month of potential use of WebRTC.
In this use case, the JavaScript comes from the operator of the conference bridge. The isolated media features of WebRTC can prevent the JavaScript from accessing the media and the identity features are used to provide a user interface that allows the user to know it connected to the correct conference. The goal is for the end users to be able to see the contents, but the web service that provides the JS and the media switching bridges and Selective Forwarding Units (SFUs) cannot access the contents (audio, video, text). The browser may choose to reveal some metadata, such as the audio power level, to the media server, in order to support functions like speaker switching.
For small groups (fewer than 20 participants) the SFU could also run within the browser, further reducing the dependency on costly centralized servers with management functions running within a web or service worker.
A possible solution this problem is the browser to negotiate end-to-end encryption keys which are not revealed to the JavaScript.
Security requirements relating to this use case are discussed in [[?MLS-ARCH]], and include the following:
Requirement ID | Description |
---|---|
N13 | It must be possible to support data exchange in a web, service, or shared worker. Support for service workers allows the page to issue a fetch() which can be resolved in the service worker. |
N25 | Only current group members can receive media or text sent to the group. |
N26 | A group member cannot send media or text that appears to be from another group member. |
N27 | The conference server must not have access to cleartext media or text or to the identity of group members. |
N28 | Perfect Forward Secrecy (FCS): access to encrypted traffic as well as all current keying material does not compromise the secrecy of media or text older than the oldest key of a compromised client. |
N29 | Post Compromise Security (PCS). Protection against past or future device compromise. |
N35 | A group member can encrypt and send copies of the realtime encoded media directly to multiple group members without re-encoding for each recipient (to reduce resource usage). |
note that the requirements from the Funny Hats usecase are also required here. |
There are use cases where it is desirable to transmit live encoded media from a non-WebRTC source over an RTP connection to a WebRTC-compatible endpoint. An example is traffic cameras that serve video over HTTP using "long poll" or similar mechanisms, but do not support WebRTC.
The source may permit dynamic reconfiguration of its resolution or frame rate in order to produce an outgoing video stream with the desired characteristics.
This use case has completed a Call for Consensus (CfC) [[?CFC-One-Way]] but has unresolved issues.
Requirement ID | Description |
---|---|
N40 | An application can create an outgoing WebRTC connection without activating an encoder. |
N41 | An application can create encoded video frames from encoded data and metadata, and enqueue them on an outgoing WebRTC connection |
N42 | The WebRTC connection can generate signals indicating the desired bandwidth, and surface those to the application. |
There are use cases with stored pre-encoded media where transmission as part of a WebRTC RTP session is desirable. These include:
The pre-encoded media may be available in multiple bandwidths, and switching between them may be possible at certain points in the media.
This use case has completed a Call for Consensus (CfC) [[?CFC-One-Way]] but has unresolved issues.
Requirement ID | Description |
---|---|
N41 | An application can create encoded video frames from encoded data and metadata, and enqueue them on an outgoing WebRTC connection |
N42 | The WebRTC connection can generate signals indicating the desired bandwidth, and surface those to the application. |
N43 | The application can modify metadata on outgoing frames so that they fit smoothly within the expected sequence of timestamps and sequence numbers. |
N44 | The application can signal the WebRTC encoder when resuming live transmission in such a way that generated frames fit smoothly within the expected sequence of timestamps and sequence numbers. |
There are use cases when we have pre-encoded media (either dynamically generated or stored) that we wish to process in the same way as one processes media coming in over a PeerConnection. This will typically involve decoding with a WebRTC decoder and generating a MediaStreamTrack.
Some use cases that need this functionality are:
Alternative implementations for these use cases may involve decoding data using WebCodecs and generating a MediaStreamTrack using VideoTrackGenerator. However, this means that Javascript will have to handle buffers containing raw media, which may not be optimal for speedy processing.
This use case has completed a Call for Consensus (CfC) [[?CFC-One-Way]] but has unresolved issues.
Requirement ID | Description |
---|---|
N45 | An application can create an incoming WebRTC connection to accept frames as if they were coming in over RTP, without creating an RTP transprort. |
N46 | An application can create encoded video frames from encoded data and metadata, and enqueue them on an incoming WebRTC connection. |
N47 | The WebRTC connection can generate signals indicating demands for keyframes, and surface those to the application. |
This section summarizes the requirements arising from the use-cases included in this document.
Requirement ID | Description |
---|---|
N01 | The user agent can control candidate gathering and pruning, limiting the networks on which candidates are gathered, the types of candidates, etc. |
N02 | The user agent must be capable of establishing multiple connections to peers without generating a separate configuration ("offer") for each connection prior to establishment. |
N03 | Congestion control must be able to manage audio quality and latency in a fair manner between multiple connections. |
N04 | The ICE agent must be able to maintain multiple candidate pairs and move traffic between them. |
N05 | The ICE agent must be able to take the network cost into account when considering re-routing. |
N06 | The user agent must be able to encode and decode video utilizing temporal scalability and (if supported by the chosen codec) spatial scalability. |
N07 | A user agent can receive audio/video without requiring construction of a corresponding sender object. |
N08 | It is possible to select the sending and/or receiving codec as well as rtcp parameters and header extensions without negotiation. |
N09 | The user agent must be able to control robustness (RTX, RED, FEC) applied to individual simulcast and SVC layers. |
N10 | It must be possible for the user agent to initiate transfer of a large file with a single API operation. |
N11 | The application must be able to signal backpressure (flow control) when receiving data. It must also receive a backpressure signal when sending data. |
N12 | It must be possible for the user agent to transfer data utilizing a congestion control algorithm that does not compete aggressively with audio/video communications. |
N13 | It must be possible to support data exchange in a web, service, or shared worker. Support for service workers allows the page to issue a fetch() which can be resolved in the service worker. |
N14 | The application must be able to minimize ICE connectivity checks. |
N15 | The application must be able to take steps to ensure a low and consistent latency for audio, video and data under varying network conditions. This may include tweaking of transport parameters for both media and data. |
N16 | It must be possible to send arbitrary data reliable, unreliable or partially reliable with a specific maximum number of retransmissions or a specific maximum timeout. |
N17 | It must be possible to send arbitrary data ordered or unordered. |
N18 | The application must be able to obtain raw media from the capture device in desired formats. |
N19 | The application must be able to insert processed frames into the outgoing media path. |
N20 | The application must be able to obtain decoded media from the remote party. |
N21 | It must be possible to efficiently share media between the main thread and worker threads. |
N22 | It must be possible to do efficient media manipulation in worker threads. |
N23 | The user agent must be able to send data synchronized with audio and video. |
N24 | Content Security Policy (CSP) support for WebRTC. |
N25 | Only current group members can receive media or text sent to the group. |
N26 | A group member cannot send media or text that appears to be from another group member. |
N27 | The conference server must not have access to cleartext media or text or to the identity of group members. |
N28 | Perfect Forward Secrecy (FCS): access to encrypted traffic as well as all current keying material does not compromise the secrecy of media or text older than the oldest key of a compromised client. |
N29 | Post Compromise Security (PCS). Protection against past or future device compromise. | N30 | The user agent must provide the ability to re-establish media after an interruption. | N31 | The user agent must provide notification of a media interruption caused by the OS (e.g. GSM incoming call) (c.f. on hold music). and support the use of a datachannel to inform the remote peer of the interruption. | N32 | The user agent must provide the ability to 'park' a connection such that it can be retrieved and continued by a newly loaded page to prevent accidental 'browsing away' from dropping a call irretrievably. | N33 | Pre-existing peers must be able to be (re-)establish a connection without access to external services in the event of the local network becoming isolated from the wider network. Without compromising e2e security but possibly leveraging pre-shared tokens from a previous connection. | N34 | Ability to intercept the fetch API and service it over a P2P link. One way to do this would be to support data channels in Service Workers which can already intercept fetch. | N35 | A group member can encrypt and send copies of the realtime encoded media directly to multiple group members without re-encoding for each recipient (to reduce resource usage). |
N36 | An application that is only receiving but not sending media or data can operate efficiently without access to camera or microphone. |
N37 | It must be possible for the user agent's receive pipeline to process video at high resolution and framerate (e.g. by controlling hardware acceleration if necessary). |
N38 | The application must be able to control the jitter buffer and rendering delay. This requirement is addressed by jitterBufferTarget, defined in [[?WebRTC-Extensions]] Section 6. |
N39 | A user-agent must be able to forward media received from a peer to another peer. Applications require access to encoded chunk metadata as well as information from the RTP header to provide for timing, media configuration and congestion control. This includes a mechanism for a relaying peer to obtain a bandwidth estimate. |
N40 | An application can create an outgoing WebRTC connection without activating an encoder. |
N41 | An application can create encoded video frames from encoded data and metadata, and enqueue them on an outgoing WebRTC connection. |
N42 | The WebRTC connection can generate signals indicating the desired bandwidth, and surface those to the application. |
N43 | The application can modify metadata on outgoing frames so that they fit smoothly within the expected sequence of timestamps and sequence numbers. |
N44 | The application can signal the WebRTC encoder when resuming live transmission in such a way that generated frames fit smoothly within the expected sequence of timestamps and sequence numbers. |
N45 | An application can create an incoming WebRTC connection to accept frames as if they were coming in over RTP, without creating an RTP transprort. |
N46 | An application can create encoded video frames from encoded data and metadata, and enqueue them on an incoming WebRTC connection. |
N47 | The WebRTC connection can generate signals indicating demands for keyframes, and surface those to the application. |
Requirements N40-N47 have unresolved comments from a Call for Consensus (CfC).