WebDriver BiDi

Editor’s Draft,

This version:
https://w3c.github.io/webdriver-bidi/
Issue Tracking:
GitHub
Inline In Spec

Abstract

This document defines the BiDirectional WebDriver Protocol, a mechanism for remote control of user agents.

Status of this document

This is a public copy of the editors’ draft. It is provided for discussion only and may change at any moment. Its publication here does not imply endorsement of its contents by W3C. Don’t cite this document other than as work in progress.

GitHub Issues are preferred for discussion of this specification.

This document was produced by the Browser Testing and Tools Working Group.

This document was produced by a group operating under the W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 March 2019 W3C Process Document.

1. Introduction

This section is non-normative.

WebDriver defines a protocol for introspection and remote control of user agents. This specification extends WebDriver by introducing bidirectional communication. In place of the strict command/response format of WebDriver, this permits events to stream from the user agent to the controlling software, better matching the evented nature of the browser DOM.

2. Protocol

This section defines the basic concepts of the WebDriver BiDi protocol. These terms are distinct from their representation at the transport layer.

This specification uses Concise Data Definition Language (CDDL) [RFC8610] to describe the format of messages transmitted between the local end and remote end.

Note: These definitions do not form any normative requirements, rather they are used elsewhere as part of such requirements.

2.1. Modules

The WebDriver BiDi protocol is organized into modules. A module is defined by:

Each module represents a collection of related commands and events pertaining to a certain aspect of the user agent. For example, a module might contain functionality for inspecting and manipulating the DOM, or for script execution.

2.2. Commands

A command is an asynchronous operation, requested by the local end and run on the remote end, resulting in either a result or an error being returned to the local end. Multiple commands can run at the same time, and commands can potentially be long-running. As a consequence, commands can finish out-of-order.

Each concrete command type is defined by:

Command parameter types and return types are defined in this specification using CDDL.

The following table of commands lists the available commands by module and name.

Module Name Command Name Command
session status session.status

2.3. Events

An event is a notification, sent by the remote end to the local end, signaling that something of interest has occurred on the remote end.

Each concrete event type is defined by:

Event parameter types are defined in this specification using CDDL.

2.4. Processing Model

To process a command given a qualified command name qualified command name, and parameters parameters:
  1. If qualified command name does not contain exactly one U+002E FULL STOP character (.), return an Error with error code unknown command.

  2. Let module name be the bytes in qualified command name preceeding the U+002E FULL STOP character.

  3. Let command name be the bytes in qualified command name following the U+002E FULL STOP character.

  4. Let command be the command listed in the table of commands whose module is module name and whose name is command name.

  5. If there is no such command, return an Error with error code unknown command.

  6. If parameters does not match the CDDL specification given by command’s parameter type, then return an Error with error code invalid argument.

  7. Let result be the result of trying to run the remote end steps for command with parameters.

  8. Assert: result matches the CDDL specification given by command’s return type.

  9. Return success with data result.

3. Transport

Message transport is provided using the WebSocket protocol. [RFC6455]

Note: In the terms of the WebSocket protocol, the local end is the client and the remote end is the server / remote host.

Note: The encoding of commands and events as messages is similar to JSON-RPC, but this specification does not normatively reference it. [JSON-RPC] The normative requirements on remote ends are instead given as a precise processing model, while no normative requirements are given for local ends.

A WebSocket listener is a network endpoint that is able to accept incoming WebSocket connections.

A WebSocket listener has a host, a port, a secure flag, and a list of WebSocket resources.

When a WebSocket listener listener is created, a remote end must start to listen for WebSocket connections on the host and port given by listener’s host and port. If listener’s secure flag is set, then connections established from listener must be TLS encrypted.

A remote end has a set of WebSocket listeners active listeners, which is initially empty.

A WebDriver session has a WebSocket connection which is a network connection that follows the requirements of the WebSocket protocol.

When a client establishes a WebSocket connection connection by connecting to one of the set of active listeners listener, the implementation must proceed according to the WebSocket server-side requirements, with the following steps run when deciding whether to accept the incoming connection:

  1. Let resource name be the resource name from reading the client’s opening handshake. If resource name is not in listener’s list of WebSocket resources, then stop running these steps and act as if the requested service is not available.

  2. Get a session ID for a WebSocket resource with resource name and let session id be that value. If session id is null then stop running these steps and act as if the requested service is not available.

  3. If there is a session in the list of active sessions with session id as its session ID then let session be that session. Otherwise stop running these steps and act as if the requested service is not available.

  4. Run any other implementation-defined steps to decide if the connection should be accepted, and if it is not stop running these steps and act as if the requested service is not available.

  5. Otherwise set session’s WebSocket connection to connection, and proceed with the WebSocket server-side requirements when a server chooses to accept an incoming connection.

Do we support > 1 connection for a single session?

When a WebSocket message has been received for a WebSocket connection connection with type type and data data, a remote end must handle an incoming message given connection, type and data.

When the WebSocket closing handshake is started or when the WebSocket connection is closed for a WebSocket connection connection, a remote end must handle a connection closing given connection.

Note: Both conditions are needed because it is possible for a WebSocket connection to be closed without a closing handshake.

To construct a WebSocket resource name given a session session:

  1. Return the result of concatenating the string "/session/" with session’s session ID.

To construct a WebSocket URL given a WebSocket listener listener and session session:

  1. Let resource name be the result of constructing a WebSocket resource name given session.

  2. Return a WebSocket URI constructed with host set to listener’s host, port set to listener’s port, path set to resource name, following the wss-URI construct if listener’s secure flag is set and the ws-URL construct otherwise.

To get a session ID for a WebSocket resource given resource name:

  1. If resource name doesn’t begin with the byte string "/session/", return null.

  2. Let session id be the bytes in resource name following the "/session/" prefix.

  3. If session id is not the string representation of a UUID, return null.

  4. Return session id.

To start listening for a WebSocket connection given a session session:
  1. If there is an existing WebSocket listener in the set of [= active listeners=] which the remote end would like to reuse, let listener be that listener. Otherwise let listener be a new WebSocket listener with implementation-defined host, port, secure flag, and an empty list of WebSocket resources.

  2. Let resource name be the result of constructing a WebSocket resource name given session.

  3. Append resource name to the list of WebSocket resources for listener.

  4. Append listener to the remote end's active listeners.

  5. Return listener.

Note: An intermediary node handling multiple sessions can use one or many WebSocket listeners. WebDriver defines that an endpoint node supports at most one session at a time, so it’s expected to only have a single listener.

Note: For an endpoint node the host in the above steps will typically be "localhost".

To handle an incoming message given a WebSocket connection connection, type type and data data:
  1. If type is not text, return.

    Should we instead close connection with status code 1003, or respond with an error?

  2. Assert: data is a scalar value string, because the WebSocket handling errors in UTF-8-encoded data would already have failed the WebSocket connection otherwise.

    Nothing seems to define what status code is used for UTF-8 errors.

  3. Let parsed be the result of parsing JSON into Infra values given data. If this throws an exception, then respond with an error given connection and error code invalid argument, and finally return.

  4. If any of the following conditions are false:

    1. parsed is a map

    2. parsed["id"] exists and is an integer in the range [0, 2147483647].

      That’s 231 - 1, the largest signed 32-bit integer. Should we allow up to 253 - 1, the largest number such that N and N + 1 both have exact representations in a JS Number?

    3. parsed["method"] exists and is a string.

    4. parsed["params"], if it exists, is a map.

    Should we fail if there are unknown keys in parsed? CDP does, but it’s very unusual for unversioned web platform APIs.

    Then respond with an error given connection and error code invalid argument, and finally return.

  5. Let result be the result of processing a command given parsed["method"] and parsed["params"].

  6. If result is an Error, then respond with an error given connection, result, and parsed["id"], and finally return.

  7. Let response be a new map with the following properties:

    "id"
    The value of parsed["id"]
    "result"
    The value of result
  8. Let serialized be the result of serializing JSON to bytes given response.

  9. Send a WebSocket message comprised of serialized over connection.

To respond with an error given a WebSocket connection connection and an error code code:
  1. Form a valid JSON errorObject given code.

  2. Send a WebSocket message comprised of errorObject over connection.

To handle a connection closing given a WebSocket connection connection:
  1. If there is a WebDriver session with connection as its connection, set the connection on that session to null.

This should also reset any internal state

Note: This does not end any session.

Need to hook in to the session ending to allow the UA to close the listener if it wants.

3.1. Establishing a Connection

WebDriver clients opt in to a bidirectional connection by requesting a capability with the name "webSocketUrl" and value true.

This specification defines an additional webdriver capability with the capability name "webSocketUrl".

The additional capability deserialization algorithm for the "webSocketUrl" capability, with parameter value is:
  1. If value is not a boolean, return error with code invalid argument.

  2. Return success with data value.

The matched capability serialization algorithm for the "webSocketUrl" capability, with parameter value is:
  1. If value is false, return success with data null.

  2. Return success with data true.

The WebDriver new session algorithm defined by this specification, with parameters session and capabilities is:
  1. Let webSocketUrl be the result of getting a property named "webSocketUrl" from capabilities.

  2. If webSocketUrl is undefined, return.

  3. Assert: webSocketUrl is true.

  4. Let listener be the result of start listening for a WebSocket connection given session.

  5. Set webSocketUrl to the result of constructing a WebSocket URL given listener and session.

  6. Set a property on capabilities named "webSocketUrl" to webSocketUrl.

4. Modules

4.1. session

The session module contains commands and events for monitoring the status of the remote end.

4.1.1. status

The status command returns information about whether a remote end is in a state in which it can create new sessions, but may additionally include arbitrary meta information that is specific to the implementation.

Parameter Type
      {}
Return Type
      StatusResult = {
         ready: bool,
         message: text,
         * text => any
      }

The remote end steps are:

  1. Let body be a new map with the following properties:

    "ready"
    The remote end’s readiness state.
    "message"
    An implementation-defined string explaining the remote end’s readiness state.
  2. Return success with data body.

5. Conformance

This specification depends on the Infra Standard. [INFRA]

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[FETCH]
Anne van Kesteren. Fetch Standard. Living Standard. URL: https://fetch.spec.whatwg.org/
[INFRA]
Anne van Kesteren; Domenic Denicola. Infra Standard. Living Standard. URL: https://infra.spec.whatwg.org/
[RFC4122]
P. Leach; M. Mealling; R. Salz. A Universally Unique IDentifier (UUID) URN Namespace. July 2005. Proposed Standard. URL: https://tools.ietf.org/html/rfc4122
[RFC6455]
I. Fette; A. Melnikov. The WebSocket Protocol. December 2011. Proposed Standard. URL: https://tools.ietf.org/html/rfc6455
[RFC8610]
H. Birkholz; C. Vigano; C. Bormann. Concise Data Definition Language (CDDL): A Notational Convention to Express Concise Binary Object Representation (CBOR) and JSON Data Structures. June 2019. Proposed Standard. URL: https://tools.ietf.org/html/rfc8610
[WEBDRIVER]
Simon Stewart; David Burns. WebDriver. URL: https://w3c.github.io/webdriver/

Informative References

[JSON-RPC]
JSON-RPC Working Group. JSON-RPC 2.0 Specification. 4 January 2013. URL: https://www.jsonrpc.org/specification

Issues Index

Do we support > 1 connection for a single session?
Should we instead close connection with status code 1003, or respond with an error?
Nothing seems to define what status code is used for UTF-8 errors.
That’s 231 - 1, the largest signed 32-bit integer. Should we allow up to 253 - 1, the largest number such that N and N + 1 both have exact representations in a JS Number?
Should we fail if there are unknown keys in parsed? CDP does, but it’s very unusual for unversioned web platform APIs.
Form a valid JSON errorObject given code.
This should also reset any internal state
Need to hook in to the session ending to allow the UA to close the listener if it wants.