Clear Site Data

Editor’s Draft,

This version:
https://w3c.github.io/webappsec-clear-site-data/
Latest published version:
http://www.w3.org/TR/clear-site-data/
Previous Versions:
http://www.w3.org/TR/2016/WD-clear-site-data-20160720/
Version History:
https://github.com/w3c/webappsec-clear-site-data/commits/master/index.src.html
Feedback:
public-webappsec@w3.org with subject line “[clear-site-data] … message topic …” (archives)
Editor:
(Google Inc.)
Participate:
File an issue (open issues)

Abstract

This document defines an imperative mechanism which allows web developers to instruct a user agent to clear a site’s locally stored data related to a host.

Status of this document

This is a public copy of the editors’ draft. It is provided for discussion only and may change at any moment. Its publication here does not imply endorsement of its contents by W3C. Don’t cite this document other than as work in progress.

Changes to this document may be tracked at https://github.com/w3c/webappsec.

The (archived) public mailing list public-webappsec@w3.org (see instructions) is preferred for discussion of this specification. When sending e-mail, please put the text “clear-site-data” in the subject, preferably like this: “[clear-site-data] …summary of comment…

This document was produced by the Web Application Security Working Group.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 March 2017 W3C Process Document.

1. Introduction

This section is not normative.

Web applications store data locally on a user’s computer in order to provide functionality while the user is offline, and to increase performance when the user is online. These local caches have significant advantages for both users and developers, but present risks as well.

A user’s data is both sensitive and valuable; web developers ought to take reasonable steps to protect it. One such step would be to encrypt data before storing it. Another would be to remove data from the user’s machine when it is no longer necessary (for example, when the user signs out of the application, or deletes their account).

Site authors can remove data from a number of storage mechanisms via JavaScript, but others are difficult to deal with reliably. Consider cookies, for instance, which can be partially cleared via JavaScript access to document.cookie. HttpOnly cookies, however, can only be removed via a number of Set-Cookie headers in an HTTP response. This, of course, requires exhaustive knowledge of all the cookies set for a host, which can be complicated to ascertain. Cache is still harder; no imperative interface to a browser’s network cache exists, period.

This document defines a new mechanism to deal with removing data from these and other types of local storage, giving web developers the ability to clear out a user’s local cache of data via the Clear-Site-Data HTTP response header.

1.1. Examples

1.1.1. Signing Out

A user signs out of Super Secret Social Network via a CSRF-protected POST to https://supersecretsocialnetwork.example.com/logout, and the site author wishes to ensure that locally stored data is removed as a result.

They can do so by sending the following HTTP header in the response:

Clear-Site-Data: cache, cookies, storage, executionContexts

1.1.2. Targeted Clearing

A user signs out of Megacorp Inc.'s site via a CSRF-protected POST to https://megacorp.example.com/logout. Megacorp has a large number of services available as subdomains, so many that it’s not entirely clear which of them would be safe to clear as a response to a logout action. One option would be to simply clear everything, and deal with the fallout. Megacorp’s CEO, however, once lost hours and hours of progress in "Irate Ibexes" due to inadvertent site-data clearing, and so refuses to allow such a sweeping impact to the site’s users.

The developers know, however, that the "Minus" application is certainly safe to clear out. They can target this specific subdomain by including a request to that subdomain as part of the logout landing page (ideally as a CORS-enabled, CSRF-protected POST):

fetch("https://minus.megacorp.example.com/clear-site-data",
      {
          method: "POST",
          mode: "cors",
          headers: new Headers({
              "CSRF": "[insert sekrit token here]"
          })
      });

That endpoint would return proper CORS headers in response to that request’s preflight, and would return the following header for the actual request:

Clear-Site-Data: cache, cookies, storage, executionContexts

1.1.3. Keep Critical Cookies

A user opts-out of interest-based advertising via a CSRF-protected POST to https://ads-are-awesome.example.com/optout. The site author wishes to remove DOM-accessible data which might contain tracking information, but needs to ensure that the opt-out cookie which the user has just received isn’t wiped along with it.

They can do so by sending the following HTTP header in the response, which includes all the types except for "cookies":

Clear-Site-Data: cache, storage, executionContexts

1.1.4. Kill Switch

Super Secret Social Network’s developers learn that the site was vulnerable to cross-site scripting attacks which allowed malicious parties to inject arbitrary code into its origin. They fixed the site, and added a strong Content Security Policy [CSP2] to mitigate the risk going forward, but they can’t be entirely sure that clients are really back to a trustworthy state. Perhaps the attackers found a clever persistence mechanism?

They can reduce the risk of a persistent client-side XSS by sending the following HTTP header in a response to wipe out local sources of data:

Clear-Site-Data: cache, cookies, storage, executionContexts

Note: Installing a Service Worker guarantees that a request will go out to a server every ~24 hours. That update ping would be a wonderful time to send a header like this one in case of catastrophe. [SERVICE-WORKERS]

1.2. Goals

Generally, the goal is to allow web developers more control over the data stored locally by a user agent for their origins. In particular, developers should be able to reliably ensure the following:

  1. Data stored in an origin’s client-side storage mechanisms like [INDEXEDDB], WebSQL, Filesystem, localStorage, and sessionStorage is cleared.

  2. Cookies for an origin’s host are removed [RFC6265].

  3. Web Workers (dedicated and shared) running for an origin are terminated.

  4. Service Workers registered for an origin are terminated and deregistered.

  5. Resources from an origin are removed from the user agent’s local cache.

  6. All of the above can be propagated to the HTTP version of an HTTPS origin.

  7. None of the above can be bypassed by a maliciously active document that retains interesting data in memory, and rewrites it if it’s cleared.

2. Clearing Site Data

Developers may instruct a user agent to clear various types of relevant data by delivering a Clear-Site-Data HTTP response header in response to a request.

The Clear-Site-Data HTTP response header field sends a signal to the user agent that it ought to remove all data of a certain set of types. The header is represented by the following ABNF [RFC5234]:

Clear-Site-Data = 1#( data-type | extension-type )
data-type = "cache" | "cookies" | "storage" | "executionContexts"
extension-type = 1*( ALPHA | "-")
; #rule is defined in Section 7 of RFC 7230.
; ALPHA is defined in Appendix B.1 of RFC 5234.

§2.2 Fetch Integration and §3.1 Parsing describe how the Clear-Site-Data header is processed.

The data-type grammar defines an initial set of known data types which can be cleared using this mechanism. See their descriptions below. Future versions of the header can support additional datatypes, which MUST comply with the extension-type grammar. User agents MUST ignore unknown extension-types whhen parsing the header.

"cache"

The "cache" type indicates that the server wishes to remove locally cached data associated with the origin of a particular response’s url. This includes the network cache, of course, but will also remove data from various other caches which a user agent implements (prerendered pages, script caches, shader caches, etc.).

Implementation details are in §3.2.3 Clear cache for origin.

When delivered with a response from https://example.com/clear, the following header will cause caches associated with the origin https://example.com: to be cleared:
Clear-Site-Data: cache
"cookies"

The "cookies" type indicates that the server wishes to remove cookies associated with the origin of a particular response’s url. Along with cookies, HTTP authentication credentials [RFC7235], and origin-bound tokens such as those defined by Channel ID [CHANNELID] and Token Binding [TOKBIND] are also cleared.

Implementation details are in §3.2.4 Clear cookies for origin.

When delivered with a response from https://example.com/clear, the following header will cause cookies associated with the origin https://example.com to be cleared:
Clear-Site-Data: cookies
"storage"

The "storage" type indicates that the server wishes to remove locally stored data associated with the origin of a particular response’s url. This includes storage mechansims such as (localStorage, sessionStorage, [INDEXEDDB], [WEBDATABASE], etc), as well as tangentially related mechainsm such as service worker registrations.

Implementation details are in §3.2.5 Clear DOM-accessible storage for origin.

When delivered with a response from https://example.com/clear, the following header will cause DOM-accessible storage for the origin https://example.com to be cleared:
Clear-Site-Data: storage
"executionContexts"

The "executionContexts" type indicates that the server wishes to neuter and reload execution contexts currently rendering the origin of a particular response’s url.

When delivered with a response from https://example.com/clear, the following header will cause execution contexts displaying the origin https://example.com to be neutered and reloaded:
Clear-Site-Data: executionContexts

2.2. Fetch Integration

Monkey patching! Talk with Anne.

If the Clear-Site-Data header is present in an HTTP response received from the network, then data MUST be cleared before rendering the response to the user. That is, after step #14 in the current HTTP-network fetch algorithm, execute the following step:

  1. If credentials flag is set, and response’s header list contains a header named Clear-Site-Data, then execute §3.2 Clear data for response on response.

Note: This happens after Set-Cookie headers are processed. If we clear cookies, we clear all of them. This is intentional, as removing only certain cookies might leave an application in an indeterminate and vulnerable state. Removing specific cookies is best done via expiration using the Set-Cookie header.

Note: While the fetch credentials flag is intended to restrict the modification of cookies, Clear-Site-Data applies the same restriction to all types for the sake of consistency.

3. Algorithms

3.1. Parsing

Given a response, the user agent can parse response’s Clear-Site-Data header, returning a list of data-type tokens, as follows:

  1. Let types be an empty list.

  2. Let header be the result of extracting header list values given Clear-Site-Data and response’s header list.

  3. If header is null or failure, return an empty list.

  4. For each type in header:

    1. If type matches the data-type grammar, append type to types.

  5. Return types.

3.2. Clear data for response

Given a response (response), the user agent can clear site data for response as follows:

  1. If response’s url is not an a priori authenticated URL, then break.

    Some have suggested that this might not be a restriction we want (see Martin Thomson’s public-webappsec post on the topic, for example).

  2. Let types be the result of parsing response’s Clear-Site-Data header.

  3. Let origin be response’s url's origin.

  4. Let browsing contexts be the result of preparing to clear data for origin and types.

  5. For each type in types:

    1. Switch on type and execute the associated algorithm:

      "cache"

      Clear cache for origin.

      "cookies"

      Clear cookies for origin.

      "storage"

      Clear DOM-accessible storage for origin.

  6. Reload browsing contexts.

Note: User agents are are encouraged to give web developers some mechanism by which the clearing operation can be debugged. This might take the form of a console message or timeline entry indicating success.

3.2.1. Prepare to clear origin’s data

Given an origin (origin) and a list of data-type tokens (types), the user agent can prepare to clear origin’s data by executing the following steps. The algorithm returns a list of browsing contexts which have been sandboxed in order to prevent them from recreating cleared data from in-memory JavaScript variables.

  1. Let sandboxed be an empty list.

  2. If types does not contain "executionContexts", return sandboxed.

  3. For each context in the user agent’s set of browsing contexts:

    1. Let document be context’s active document.

    2. If document’s relevant settings object's origin is not origin, continue.

    3. Parse a sandboxing directive using the empty string as the input, and document’s active sandboxing flag set as the output.

    4. Append context to sandboxed.

  4. Return sandboxed.

3.2.2. Reload browsing contexts

Given a list of browsing contexts (contexts), the user agent can reload browsing contexts as follows:

  1. For each context in contexts:

    1. Execute context’s active document's relevant settings object's global object's Location object’s reload().

      This is the simplest thing, but it’s probably reaching a little too far into the documents and mucking with their context. I probably just need to break down and copy/paste the relevant bits from HTML.

3.2.3. Clear cache for origin

Given an origin (origin), the user agent can clear cache for origin as follows:

  1. Let host be origin’s host.

  2. Let cache list be the set of entries from the network cache whose target URI host is identical to host.

  3. For each entry in cache list:

    1. Remove entry from the network cache.

  4. If a user agent implements caches beyond a pure network cache, it MUST remove all entries from those caches which match origin.

    We’re dealing with the network cache here, as defined in [RFC7234], but that’s not nearly everything a user agent caches. How hand-wavey with the vendor-specific section can we be? For instance, Chrome clears out prerendered pages, script caches, WebGL shader caches, WebRTC bits and pieces, address bar suggestion caches, various networking bits that aren’t representations (HSTS/HPKP, SCDH, etc.). Perhaps [STORAGE] will make this clearer?

3.2.4. Clear cookies for origin

Given an origin (origin), the user agent can clear cookies for origin as follows:

Note: We remove all the cookies for an entire registered domain, as cookies ignore the same-origin policy, and there’s a distinct risk that we’d leave applications in an ill-defined state if we only cleared cookies for a particular subdomain. Consider accounts.google.com vs mail.google.com, for instance, both of which have cookies that signal a user’s signed-in status.

Note: This algorithm assumes that the user agent has implemented a cookie store (as discussed in Section 5.3 of [RFC6265]), which offers the ability to retrieve a list of cookies by host, and to remove individual cookies.

  1. Let registered be the registered domain of origin’s host.

  2. Let cookie list be the set of cookies from the cookie store whose domain attribute is a domain-match with registered.

  3. For each cookie in cookie list:

    1. Remove cookie from the cookie store.

  4. If the user agent supports other forms of cookie-like storage, these MUST also be cleared for origins whose host's registered domain is registered.

    Note: For example, if the user agent supports Flash, its local stored objects will be cleared via NPP_ClearSiteData.

  5. Clear any Channel IDs [CHANNELID] and bound tokens [TOKBIND] associated with origins whose host's registered domain is registered.

  6. Clear authentication entries and proxy-authentication entries associated with origins whose host's registered domain is registered.

The process of clearing both bound tokens/IDs and HTTP authentication is super hand-wavey. <https://github.com/w3c/webappsec-clear-site-data/issues/2>

3.2.5. Clear DOM-accessible storage for origin

Given an origin (origin), the user agent can clear DOM-accessible storage for origin as follows:

  1. For each area in the user agent’s set of local storage areas [HTML]:

    1. If area’s origin is origin:

      1. Execute clear() on the Storage object associated with area.

  2. For each area in the user agent’s set of session storage areas [HTML]:

    1. If area’s origin is origin:

      1. Execute clear() on the Storage object associated with area.

  3. For each database in the set of databases for origin [INDEXEDDB]:

    1. Delete database.

  4. For each registration in the user agent’s set of service worker registrations:

    1. If registration’s scope URL’s origin is origin:

      1. Execute unregister() on registration.

  5. For any other script-accessible storage mechanism, the user agent MUST delete any data associated with this origin. This includes (but is not limited to) the following:

    1. An origin’s WebSQL databases [WEBDATABASE].

    2. An origin’s filesystems [file-system-api]

    3. Plugin data (e.g. Flash via NPP_ClearSiteData),

    4. Appcache.

    5. Moar?

4. Security Considerations

4.1. Incomplete Clearing

It is possible that an application could be put into an indeterminate state by clearing only one type of storage. We mitigate that to some extent by clearing all storage options as a block, and by requiring that the header be delivered over a secure connection.

4.2. Service workers

It is imperative that the Clear-Site-Data header is only respected on responses fetched over network, and not those served by a service worker.

This is because service workers can return arbitrary responses for resource requests in their scope, including third-party requests. Thus, supporting Clear-Site-Data would give them the ability to clear data for any origin.

Note that if a request is sent to a service worker, not handled by it, then restarted with a service-workers mode of "none" and sent to the network, the corresponding response is a network response and can be handled. The previous attempt at obtaining the response from a service worker is irrelevant.

Note also that a service worker update is a network response, and is therefore not affected by this restriction. This is important in order to support the use case in §1.1.4 Kill Switch.

5. Privacy Considerations

5.1. Web developers control the timing.

If triggered at appropriate times, Clear-Site-Data can increase a user’s privacy and security by clearing sensitive data from their user agent. However, note that the web developer (and not the user) is in control of when the clearing event is triggered. Even assuming a non-malicious site author, users can’t rely on data being cleared at any particular point, nor are users in control of what data types are cleared.

If a user wishes to ensure that site data is indeed cleared at some specific point, they ought to rely on the data-clearing functionality offered by their user agent.

At a bare minimum, user agents OUGHT TO (in the [RFC6919] sense of the words) offer the same functionality to users that they offer to web developers. Ideally, they will offer significantly more than we can offer at a platform level (clearing browsing history, for example).

5.2. Remnants of data on disk.

While Clear-Site-Data triggers a clearing event in a user’s agent, it is difficult to make promises about the state of a user’s disk after a clearing event takes place. In particular, note that it is up to the user agent to ensure that all traces of a site’s date is actually removed from disk, which can be a herculean task (consider virtual memory, as a good example of a larger issue).

In short, most user agents implement data clearing as "best effort", but can’t promise an exhaustive wipe.

If a user wishes to ensure that site data does not remain on disk, the best way to do so is to use a browsing mode that promises not to intentionally write data to disk (Chrome’s "Incognito", Internet Explorer’s "InPrivate", etc). These modes will do a better job of keeping data off disk, but are still subject to a number of limitations at the edges.

6. IANA Considerations

The permanent message header field registry should be updated with the following registration: [RFC3864]

6.1. Clear-Site-Data

Header field name
Clear-Site-Data
Applicable protocol
http
Status
standard
Author/Change controller
W3C
Specification document
This specification (See §2.1 The Clear-Site-Data HTTP Response Header Field)

7. Acknowledgements

Michal Zalewski proposed a variant of this concept, and Mark Knichel helped refine the details.

Conformance

Document conventions

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Conformant Algorithms

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps can be implemented in any manner, so long as the end result is equivalent. In particular, the algorithms defined in this specification are intended to be easy to understand and are not intended to be performant. Implementers are encouraged to optimize.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[CHANNELID]
Dirk Balfanz; Ryan Hamilton. Transport Layer Security (TLS) Channel IDs. URL: https://tools.ietf.org/html/draft-balfanz-tls-channelid
[FETCH]
Anne van Kesteren. Fetch Standard. Living Standard. URL: https://fetch.spec.whatwg.org/
[FILE-SYSTEM-API]
Eric Uhrhane. File API: Directories and System. URL: https://www.w3.org/TR/file-system-api/
[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[INDEXEDDB]
Nikunj Mehta; et al. Indexed Database API. URL: https://www.w3.org/TR/IndexedDB/
[IndexedDB-2]
Ali Alabbas; Joshua Bell. Indexed Database API 2.0. URL: https://www.w3.org/TR/IndexedDB-2/
[INFRA]
Anne van Kesteren; Domenic Denicola. Infra Standard. Living Standard. URL: https://infra.spec.whatwg.org/
[MIXED-CONTENT]
Mike West. Mixed Content. URL: https://www.w3.org/TR/mixed-content/
[PSL]
Public Suffix List. Mozilla Foundation.
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[RFC3864]
G. Klyne; M. Nottingham; J. Mogul. Registration Procedures for Message Header Fields. September 2004. Best Current Practice. URL: https://tools.ietf.org/html/rfc3864
[RFC5234]
D. Crocker, Ed.; P. Overell. Augmented BNF for Syntax Specifications: ABNF. January 2008. Internet Standard. URL: https://tools.ietf.org/html/rfc5234
[RFC6265]
A. Barth. HTTP State Management Mechanism. April 2011. Proposed Standard. URL: https://tools.ietf.org/html/rfc6265
[RFC6454]
A. Barth. The Web Origin Concept. December 2011. Proposed Standard. URL: https://tools.ietf.org/html/rfc6454
[RFC7230]
R. Fielding, Ed.; J. Reschke, Ed.. Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. June 2014. Proposed Standard. URL: https://tools.ietf.org/html/rfc7230
[RFC7234]
R. Fielding, Ed.; M. Nottingham, Ed.; J. Reschke, Ed.. Hypertext Transfer Protocol (HTTP/1.1): Caching. June 2014. Proposed Standard. URL: https://tools.ietf.org/html/rfc7234
[RFC7235]
R. Fielding, Ed.; J. Reschke, Ed.. Hypertext Transfer Protocol (HTTP/1.1): Authentication. June 2014. Proposed Standard. URL: https://tools.ietf.org/html/rfc7235
[SERVICE-WORKERS]
Alex Russell; et al. Service Workers 1. URL: https://www.w3.org/TR/service-workers-1/
[TOKBIND]
Andrei Popov; et al. The Token Binding Protocol Version 1.0. URL: https://tools.ietf.org/html/draft-ietf-tokbind-protocol
[URL]
Anne van Kesteren. URL Standard. Living Standard. URL: https://url.spec.whatwg.org/
[WEBDATABASE]
Ian Hickson. Web SQL Database. 18 November 2010. NOTE. URL: https://www.w3.org/TR/webdatabase/

Informative References

[CSP2]
Mike West; Adam Barth; Daniel Veditz. Content Security Policy Level 2. URL: https://www.w3.org/TR/CSP2/
[RFC6919]
R. Barnes; S. Kent; E. Rescorla. Further Key Words for Use in RFCs to Indicate Requirement Levels. 1 April 2013. Experimental. URL: https://tools.ietf.org/html/rfc6919
[STORAGE]
Anne van Kesteren. Storage Standard. Living Standard. URL: https://storage.spec.whatwg.org/

Issues Index

Monkey patching! Talk with Anne.
Some have suggested that this might not be a restriction we want (see Martin Thomson’s public-webappsec post on the topic, for example).
This is the simplest thing, but it’s probably reaching a little too far into the documents and mucking with their context. I probably just need to break down and copy/paste the relevant bits from HTML.
We’re dealing with the network cache here, as defined in [RFC7234], but that’s not nearly everything a user agent caches. How hand-wavey with the vendor-specific section can we be? For instance, Chrome clears out prerendered pages, script caches, WebGL shader caches, WebRTC bits and pieces, address bar suggestion caches, various networking bits that aren’t representations (HSTS/HPKP, SCDH, etc.). Perhaps [STORAGE] will make this clearer?
The process of clearing both bound tokens/IDs and HTTP authentication is super hand-wavey. <https://github.com/w3c/webappsec-clear-site-data/issues/2>
Moar?