Participants
Alex Christensten, Nic Jansma, Dan Shappir, Sean Feng, Jeffrey Yasskin, Ian Clelland, Rafael Lebre, Michal Mocny, Carine Bournez, Sia Karamalegos, Mike Jackson, Patrick Meenan, Giacomo Zecchini,
Admin
- Next meeting: October 26 (later timeslot: 10am PST, 1pm EST, 5pm GMT)
- WebPerf WG Charter under AC review
Agenda
Privacy principles and ancillary data - Yoav
Recording
- Yoav: Briefly talk about an effort happening between TAG and PING
- ... Privacy principles being worked on in TAG repo
- ... Also conversations around data minimizations and principles around that
- ... Sites, user-agents, everyone should minimize personal data exposed to web
- ... Personal data can mean anything, characteristics, how they interact with the page, network, etc considered personal data
- ... Doc also defines ancillary uses of data (non-functional use) defines that data is ancillary as well
- ... In current version of the document, if we have a single piece of data being used for functional reasons (e.g. click event exposes timestamp, functional reason for that), but can also be used for ancillary uses (i.e. when it happened), that data becomes ancillary as well depending on how it's being used
- ... I pointed out that issue to Jeffrey, as a result there's a second PR on that front that is trying to rewrite that section to address that discrepancy
- ... This PR proposes instead of defining ancillary data as part of its usage, we define two types of ancillary APIs
- ... 1. Exposes non-ancillary data through other means
- ... 2. Exposes data not exposed through non-ancillary APIs
- ... Examples of that is DNS timing in RT, presentation times in ET, memory measurement (new kinds of data that is only exposed for measurement, monitoring and regression prevention purposes)
- ... Fact that some data is ancillary doesn't mean it has an outsized privacy risk or is particularly sensitive, but it should be looked at per data minimization principles in that doc
- ... Regarding that PR, no consensus in task force that it's OK that non-ancillary data is OK to report for ancillary uses
- ... "Reducing collection cost would increase data collection" and is going against what the principal is trying to prevent
- ... Assuming PR lands in some form, we have distinction between ancillary data is novel data that's not available for other means
- ... Two potential mitigations
- ... 1. User permission to access that data
- ... 2. Private aggregation of that data
- ... For user permission front, in my opinion that may make sense for info that is sensitive or requires extra debugging info for the machine type (PII of the user), but it can be cumbersome and deter folks from using the right APIs (and they might try to get access to this info from other means)
- ... For private aggregation, we talked about this at TPAC last year. Two potential shapes, in rough sketch that we have of these APIs from privacy arena
- ... An API that is key-value based, key and measurement, which gets uploaded to an aggregation server
- ... Browser defines metrics based on some predefined keys (metric type and origin or URL hash) and values (is measurement itself, i.e. DNS times), an browser-internal reporting that defines the key as a DNS metric+host, the value is the time it takes for the browser to measure this DNS.
- ... Sent out to aggregations server, and RUM providers or Origins could ask aggregation server for this data in aggregate
- ... Use in aggregated histogram form
- ... Other is some form of a worklet, where site can define code that has access to various kinds of ancillary data, and that code cannot talk to the page, but can output histogram distributions that would get sent to an aggregation server
- ... Gets site more control over what gets reported, output is a histogram
- ... Both would make it hard to present resource-specific data, unless we can do histogram per-resource per-metric
- ... For sure this will be a significant shift from what RUM providers are currently doing
- ... If we move to this kind of model for ancillary data, we could have access to cross-origin data we don't see due to existing security restrictions that we don't necessarily have in the aggregate
- ... Bring this discussion to the group's attention and gather thoughts on that
- ... I'm trying to represent what we think when talking to privacy task forces
- Jeffrey: Step back and share why we're looking t this at all
- ... Sense from privacy folks on the task force that this group produces a bunch of APIs that could be turned off, and users could continue doing what they do. Might sacrifice website's long term health
- ... Task force members want users to be able to better control what ancillary data is used
- ... User be able to turn that off and not have that data contribute to the site
- ... We can't turn off DOM APIs that let sites get this, APIs that summarize that data are a lost cause
- ... What should we do about APIs that expose new information?
- ... Question that the task force asks that I don't have an answer to, how does this group think those APIs should be constrained?
- ... Different set of constraints that this group wants to write into privacy principles
- ... You have some constraints you're operating under, maybe not a principle yet but you have a one-off for each API
- ... Set of principles for this group that you write under would be good
- ... APIs that expose new information should not expose new data, aggregate or ask permission
- Dan: Question, the discussion seems a bit theoretical to me
- ... Potential for harm, but I'm not aware of concrete specific examples of harm
- ... Discussion about cookies and privacy related to cookies
- ... Specific examples of harm are well-known and documented
- ... Do we have specific examples of harm being done using existing APIs?
- Alex: There is a lot of requests for accessing new data that we have a great use for to improve a website, and I believe you have use-cases, and am sympathetic
- ... But have to consider how data could be abused also, most of it is fingerprinting
- ... Websites want to know exactly who a user is using this website
- ... Is there high-background memory usage or CPU usage, then it's more likely it's the same user that saw this website with same characteristics
- ... Concrete example that's giving fuzzy but useful fingerprinting data
- ... User has not indicated that they want to be fingerprinted
- ... Happy to hear we're talking about aggregate anonymous data collection, we'd have much less objection to
- ... Want website to have data but just not have knowledge of users from whom that data came (without their consent)
- Yoav: Main risk is fingerprint-ability, every new piece of data adds a few bits of entropy that can be used to target the specific user
- Jeffrey: We can write a principle without saying we have to use it right away. Sets a long-term goal for the group. We haven't designed the aggregation APIs we need. We can still ship APIs without having to think if they contribute to fingerprinting bits one by one.
- Yoav: To build on Dan's question and Alex's answer, I think that it's good to look at this from the risk and mitigation perspective.
- ... Helps my case at looking at novel data vs. already-exposed data
- ... Already-exposed data is already available + active fingerprinting data
- ... Slightly more coarse doesn't, in most cases, enable any new kinds of attacks if we're looking at fingerprinting as the risk
- Dan: Obviously any bit of information contributes to fingerprinting, but that's just potentially. Concrete examples?
- ... Another question, aggregate collection of data into an aggregation server. My understanding is that it's opt-in, when do you envision the user opting in?
- Yoav: Two things, user-permission and aggregated reporting. In my head those are mutually exclusive
- Dan: When using aggregation, they're automatically opted-in?
- Yoav: Yes
- Dan: Critical. If it's opt-in, can we assume no data would be captured?
- ... If it's per-site, in addition to cookies/etc, they'd have to opt-in to performance reporting
- ... Less than ideal
- Jeffrey: Want principle to say if it's aggregated (de-personalized), it can be used by default
- ... Users should be able to opt-out
- ... Most aggregate systems have a threshold of identifiability, maybe the browser would be leaking something, so users could opt-out
- ... Second question for Alex, are you comfortable with an API that exposes information already shared from a DOM, is that OK?
- Alex: Agreed, but the standard of what's exposed from the DOM is different by different browsers. E.g. what's exposed in CHrome isn't necessarily done so by Safari
- Jeffrey: Question of how to phrase new information is hot topic on task force
- ... Wording suggestions there are welcome
- Alex: We already also ship on-by-default anonymous data gathering features that users can opt-out of, but on by default for 99%+ of users
- Sia: Step back, there's going to be harms on both sides of this, so there are tradeoffs
- ... Harm of fingerprinting, but also harm on lower-end usage
- ... Framing of that discussions, issue of equity and other issues as well
- Benjamin: Point out that the charter explicitly has out-of-scope performance analysis
- ... For e.g. compute pressure, I think these are out of scope, do we need to revisit this in the charter
- Yoav: This isn't about performance analysis, but is about gathering data
- ... Out of scope is how to analyze those bits afterwards
- Benjamin: Talking about data collection that is used for data analysis
- Yoav: sendBeacon(), fetchLater(), are about data collection that is used for analysis. Bits in charter are around how does one process that data after it's collected, not about how browser sends it out
- ... If we defined aggregate reporting in the WG, and there are issues with charter around that, we can revisit at that point
- Jeffrey: To Sia's point, it's been hard to get a privacy document to talk about tradeoffs with non-privacy goals. But it does say it doesn't trump other web principles.
- ... May need to trade off privacy principles with other goals
- ... API may not strictly adhere to privacy principles
- Yoav: We've been talking about maybe creating monitoring and deployment principles, or some other document that talks about the broader good that the APIs this working group working on do, and enable something to anchor other principles that one can tradeoff with privacy and others
- Katie: As someone who cares deeply about privacy but works at a company that cares about performance because of the measurable impact we can track (e.g. conversion rate, bounce rate).
- ... Frightening to me that we'd lose fidelity about ways users stayed on the site and converted
- ... Reality on the ground is we sell perf to companies because it will improve their bottom-line
- ... Without being able to get something approximating that user data, it would be hard to continue to make the case for web performance in a corporate setting
- Sia: Are you saying it'd be hard to justify for corporations, but maybe other other organizations could?
- ... Buying things is a part of life
- Katie: There are a lot of moral ambiguity here
- ... As someone who navigates this question at work of why do we invest in performance, it drives the bottom line
- ... I wish it wasn't that way, but it is
- Jeffrey: I think we can get performance APIs that can get that connection
- ... Privacy Sandbox folks are tracking seeing an ad to buying something
- ... I think we can get APIs that are similarly private
- Yoav: I don't think we're talking about scrapping existing APis and moving to aggregated data
- ... New and aggregate data that isn't available elsewhere
- ... We still can do the tracking for that same-origin traffic and report that data for everything web-exposed
- ... While decorating that with histograms of e.g. DNS times
- Michal: Wanted to follow-up with Benjamin's example of Compute Pressure
- ... Been consulting a few clients referencing this API
- ... Web platform feature that's not ancillary data
- ... Goal is for users that have compute-intensive features, ways to identify ways of backing off fidelity of experience
- ... Non-ancillary data API
- ... Demand for a feature like that, might be alternatives, whatever
- ... Might be a future proposal to summarize that data for ancillary purposes
- ... Does a site tend to be under compute pressure in the aggregate
- ... Maybe a large-scale change that the site would want to make
- ... In one earlier revision of privacy principles, users should have to pay for that ancillary use
- ... I think we're past that
- ... Ancillary data should only be through aggregate, privacy-preserving
- Yoav: MSFT folks talked about being able to use something like Compute Pressure as a dimension to be able to split data on, in order to distinguish NT vs. ones on idle machines
- Nic: as a RUM provider I want to help our customers. Always wanted to report on the entire page weight where aggregated reporting can help us get there.
- … but it would be hard to incorporate it. Want to be part of this conversation
User timing and framework use counters - Annie
Recording
- Annie: No proposed API changes here, just convention on how we use UserTiming
- ... UserTiming L1 spec, there was standards (we had pulled since), mark_fully_loaded, mark_fully_visible
- ... For sites to report their own custom load times, if they don't want to use e.g. LCP
- ... New use-case for a convention
- ... Frameworks / CMS features
- ... Image directives, a feature to add fetch priority hints
- ... Font modules, fallback fonts
- ... 3P modules allowed to load 3Ps without putting scripts into the critical path
- ... Using them can improve performance
- ... We'd like to better understand if they're working well
- ... UserTiming mark might be a good way to know if the feature was being used
- ... Marks are already in traces, lab tools could say e.g. using a feature would improve
- ... RUM providers could note usage, and show A/B test results
- ... HTTP Archive could show usage
- ... Proposed syntax
- ... Wondering if others' would find this convention useful
- Sia: Are you thinking about more here?
- ... Feature is NgOptimizedImage and the version is XYZ?
- Annie: Yeah if we use detail, it's type any, so you could add more information
- Sia: I think it's interesting
- ... Rick just added to HTTP Archive capturing the shopify data
- ... This could be cool
- Yoav: I think there are two different things here
- ... HTTP Archive can expose this detail, and then people could do complex queries on it
- ... Chrome and use-counters could also expose a predefined set
- ... If you would want a NgOptimizedImage5, could be exposed in HTTP Archive but not use counter
- ... People can contribute a one-line to library where after approved, it becomes a use-counter you'd get stats on usage in the wild
- Annie: For bigger orgs you can control what's in there?
- Katie: +1 to the use-case of organizations being able to define their own feature usage
- ... If we're running synthetic test via cron on SpeedCurve, how to tie back to a feature experiment is running is proving difficult
- ... Automatically goes into a format that WPT could use
- ... Could open up a few doors for being able to tie performance back to an internal feature
- Annie: Being able to split experiments
- Katie: Being able to track when this new feature is appearing
- ... How to keep track of when everything is happening and is combined
- ... Trying to figure out better ways of tracking that without it being burdensome computationally
- ... Can see a lot of usage for this for us
- Nic: In L1 we had these fancy standard names, it’s not in L3?
- Annie: I should add it back
- Nic: And then it’d be one of those. For the features, would your team come up with a list of suggested features? Would there be a standard for feature names?
- Annie: Names could collide
- Nic: You mentioned existing features - angular specific
- Annie: and you could add more things that are org specific
- Pat: Would help to add guidance to consider when you log the mark
- .. e.g. use-counters mark the time they were first used on the page
- … may make sense for some features
- Yoav: I think we can benefit from standard names, if we want to reflect this in Chrome use-counters, and we want WPT or Lighthouse to say smart things about framework features, I think we should have a standardized or wiki'd list.
- ... Would have to be well-known in Chromium code
- ... e.g. there could be collisions
- ... Add some namespace, and a way of serializing namespace and feature so it avoids collisions
- Annie: Will add a Github issue and link here