W3C

W3C Workshop on Web & Virtual Reality

October 19-20, 2016; San Jose, CA, USA

Workshop Report

Executive Summary

During the Web & Virtual Reality Workshop organized by W3C in October 2016, 120 participants representing browser vendors, headset and hardware manufacturers, VR content providers, designers and distributors analyzed the opportunities provided by making the Web a full-fledged platform for VR experiences.

They recognized the strong prospects already opened by existing and in-development Web APIs, in particular the WebVR API that was highlighted as an important target for near-term standardization, as well as the high priority of making the Web a primary platform for distributing 360° videos. They also identified new opportunities that would be brought by enabling traditional Web pages to be enhanceable as immersive spaces, and in the longer term, by making 3D content a basic brick available to Web developers and content aggregators.

Summary

W3C held a two day workshop on Web and Virtual Reality in San Jose on 19-20 October, 2016, hosted by Samsung. The primary goal of the workshop was to bring together practitioners of Web and Virtual Reality technologies to start the discussion on how to make the Web platform a better delivery mechanism for VR experiences. The goal to bring together the right people was greatly exceeded: the workshop convened a diverse group of participants across Web and VR industries to set the future direction and priorities for VR on the Web. Participants included all major browser makers and VR hardware vendors, as well as key VR platform providers, VR content producers and distributors, VR and 3D Web software developers, VR experience designers, users of VR, and VR accessibility experts, in total 120 registered participants.

The workshop’s secondary goals were to share experiences between practitioners in VR and related fields, discuss how to solve for VR use cases that are difficult or impossible today on the Web, and to identify potential future standards and establish timelines to enable the Web to be a successful VR platform. The secondary goals were also met and exceeded in productive discussions that took place at the workshop in ten one hour-long focus sessions. These sessions started off with multiple short talks on the topic followed by group discussion and joint summary. The lightning talk sessions covered a lot of ground, from VR user interactions and multi-user VR experiences to authoring VR experiences, while the breakout sessions deep dived into selected topics. Audio and video opportunities related to VR were discussed in the immersive audio panel and 360° video on the Web lightning talks respectively. At the end of the action-packed first day, a demo session was organized in which participants got to show off VR technologies in the making relevant to the workshop, including bleeding edge of VR authoring tools on the Web, the latest WebVR implementations, demos of shared VR spaces, and positional audio, among others.

At the end of the workshop, the VR standardization landscape was constructed considering input received during the workshop. The landscape analysis identified existing W3C standardization work that was seen beneficial to VR, recommendations for new standardization work, as well as longer-term standardization targets to explore. Furthermore, closer W3C-Khronos coordination in the VR space was called for, and opportunities for collaboration across other standards organizations, including Web3D Consortium, Moving Picture Experts Group, and Open Geospatial Consortium were identified. The work on more mature proposals continues in the existing W3C and Khronos Working Groups, while more exploratory ideas will be incubated in appropriate W3C Community Groups and Khronos Communities.

Sessions

See also: Presentations and Minutes

The workshop kicked off with a plenary session and introduction to W3C by Dominique Hazael-Massieux. The introduction was followed by a keynote by Sean White, SVP of Emerging Technologies at Mozilla that set the tone for the workshop. A WebVR implementation status update from browser vendors Google, Mozilla, Samsung, Oculus VR, and Microsoft closed off the plenary session and oriented the audience on how much is possible already today, and what are the gaps.

The topics for the workshop lightning talk sessions and panel discussions that followed the plenary session were chosen based on suggestions from the workshop participants, and the talks themselves among the high quality submissions received. The workshop sessions were the following:

VR user interactions in browsers

See also: Presentations and Minutes

VR user interactions in browsers session identified needs for basic primitives to capture user intents in VR experiences and explored nascent UI patterns in VR interactions.

In this session, it was demonstrated how a traditional browser UI can be seen as a disruption in VR that needs to be rethought for WebVR use cases. Speed and frictionless access to content are at the core of the Web and hyperlinks are the fundamental building block of the Web architecture. These defining characteristics of the Web have to be retained in WebVR to ensure navigating from web pages to VR worlds is as seamless and natural experience as browsing the traditional Web.

“They watch some demos or play a game and walk away saying how impressive it is, but almost everyone makes a remark about how they wish they had hands.” - Brandon Jones

Interacting with hands in a virtual space is a natural interaction mechanism. Two type of high-level hand tracking mechanisms were demonstrated: skeleton tracking for avatar use cases, and cursor tracking using a hand gesture for accurate interface control. Also a laser-pointer style default input mechanism was discussed and demonstrated that provides a fallback for non-VR platforms.

Learnings from Samsung VR Browser highlighted issues in porting advanced browser UI features such as tab management to VR. On the other hand, a feature in the Samsung VR Browser that allows the 2D web page to customize the VR space around web content viewport was considered a good example of progressively enhanced 2D content for VR.

In the joint discussion, it was noted common interaction patterns could be repurposed for WebVR, taking in learnings from 3D games and related libraries, as well as from research proceedings (e.g. IEEE VR). A challenge identified is the cultural variety and domain specificity of standard set of gestures. The Web has established a set of standard primitives (e.g. the scrollbar) that work everywhere and behave in a predictable way. Similar primitives are required for VR in the long term. Furthermore we need standard ways for WebVR content developers to make use of these primitives such that not every WebVR page will behave completely differently and thus confuse users.

From standardization point of view, the following opportunities for further investigation were identified:

Accessibility of VR experiences

See also: Presentations and Minutes

Accessibility of VR experiences session was set to determine what enables an accessible VR experience, and what basic hooks are needed to make VR approachable in a casual way.

In this session, it was discussed how accessibility crosses a number of domains and disabilities, including auditory, cognitive, neurological, physical, speech, visual. Accessibility is inherently multimodal i.e. provides information from modality to another, and the needs span across input and output. Lessons learned from making video and images accessible are applicable to VR.

In the broader context, 2D web should be accessible for all in VR, that including people without disabilities. The current web is mainly 2D content that assumes input mechanisms not optimal for VR. Opportunities to improve 2D content accessibility in VR are to enable high quality 2D content reprojection (text, cylinder surfaces, and custom projections), use of big hit targets, and careful management of viewport characteristics. Also, rethinking the user input models to work better with gaze-based input, voice commanding, and make use of predictive and context aware keyboards need to be explored into. Trust and understanding of security considerations to protect against attacks such as UI spoofing is important. Mitigation strategies are to use secure transport, secure inputs to target context, integrate with password management, form filling, and integrate with web payments. Making interactive 360° photos and video accessible was seen as a gateway to broader 2D web accessible to all in VR.

Multi-user VR experiences

See also: Presentations and Minutes

Multi-user VR experiences discussed challenges and opportunities of building social WebVR experiences while being immersed in an isolated virtual world, and identified known technical solutions and gaps.

In this session, experiences in building both large-scale and smaller-scale experimental multi-user services were discussed. VR was seen as a disruptor like the Internet or smartphone due to degrees of input freedom and natural communications. General use of VR for communication and commerce is predicted to outstrip the use in entertainment. The client/server model is deemed better suited for multi-user VR as it scales to millions of VR servers in contrast to the vertical model based on curated app stores. A platform used for building large-scale multi-user VR services has to adhere to the following requirements: low-latency, real-time data transmission (<100 ms end-to-end delay), ability to mix 3D audio and avatar motion data, support for compressed scene description, real-time scene and object updates, and distributed servers. Possible areas for standards include identity, content portability (appearance and scripting), certificates of authenticity for assets, and server discovery. On the other hand, the web is already a feasible platform for building small-scale copresent VR experiences, as demonstrated by the prototype built using WebVR, WebAudio, and WebRTC.

Immersion is key to presence and engagement in multi-user use cases, and we need to be able to show up as we are at every moment. However, avatars are not up to the social skills we have developed. Being able to convey body language and nonverbal communication in multi-user VR requires eye contact and facial expression, dynamic posturing and positioning, spatial behaviours and strategies, and effective touch and manipulation. Challenge presented is to allow WebVR convey this type of natural communication.

Mixed Reality Service is a proposal for providing a metadata layer critical for large-scale multi-user VR. It binds the real and virtual worlds, translating coordinates into URIs. With simple modifications, MRS provides mapping services for both mixed reality and virtual reality.

Authoring VR experiences on the Web

See also: Presentations and Minutes

(Work in progress)

High-performance VR on the Web

See also: Presentations and Minutes

High-performance VR on the Web session shared WebVR implementation learnings and performance best practices, identified pitfalls to avoid when targeting low-latency VR, as well as identified needs of other performance-enhancing technologies, such as WebAssembly, in the context of VR.

In the Chrome Android learnings & pitfalls to avoid discussion, a clear target for high-performance was agreed upon: <20 ms motion-to-photon latency. The straightforward solution is fast rendering, to render at target frame rate, and poll pose as late as possible. Dropped frames in this solution are very unpleasant, and adaptive renderer that can scale quality is beneficial. Alternative approach is time warp (or space warp) that adjust the view in distortion phase to work around late frames, a solution that is smooth but can introduce artefacts.

Gear VR Performance Tweaks and Pitfalls discussion outlined why mobile VR is hard. Current mobile platforms introduce system integration issues in contrast to PC, and specific to WebVR issues such as oversized textures, and the overuse of performance-sensitive WebGL API calls. Development tools used for performance profiling on mobile are hard to use, especially for web developers.

Building a WebVR Content Pipeline discussions touched upon how to prepare, stream, and optimize content for VR. VR content is highly complicated, as it includes meshes, textures, shaders, and skins, and the WebVR content pipeline has to deal with this complexity. Build systems and tooling need to evolve with learnings from the game industry. Existing build solutions for the Web (e.g. Webpack, Browserify) are optimized for large 2D web sites, and do not cater for the special requirements of the WebVR content. Also packaging is not optimal solution, as it requires a large download and hurts iterative development.

On the server-side, Cross-origin Resource Sharing enables hosting freely addressable content and smart CDN could be used to offload tasks such as texture size optimizations depending of the device capabilities. On the client-side, progressive texture loading (with vertex color fallback), service workers (predict, prefetch, cache), web app manifest, and best practices for web developers for defined application types would help ensure the WebVR remains a no install, no long download, no wait environment with progressive enhancement at its core.

WebVR next discussion outlined possible goals for a future more performant version. Such as adding more WebGL performance primitives, support for device specific layers, and more interaction with other related specifications: Web App Manifest, Service Worker, WebGL, Web Imaging, Gamepad, Workers. Some of the proposed changes are incremental,, and some may introduce major breaking changes. Performance-enhancing WebGL capabilities identified include multi-view instancing to reduce overhead, geometry tools (fonts, HTML to geometry), and compute shaders. Furthermore, retained more capabilities such as persistent static layer for e.g. loading screens, cursor layer, and cylinder layers for high quality 2D surface presentation we identified. It was also noted, we should be cautious in getting too far, to not become a full retained scene graph. The execution model of browsers, including threading, garbage collection, and the HTML5 event loop is currently optimized for 2D HTML content, and could benefit from optimizations informed by WebVR learnings.

360° video on the Web

See also: Presentations and Minutes

360° video on the Web session identified needs for evolutions in streaming infrastructure (both on server and client side) to adapt to the heavy needs of 360° content streaming, and build understanding on what changes are needed to HTML media interfaces to make them suitable for 360° media content.

In this session, a proof-of-concept 360° video cloud streaming solution that enables high quality 360° video experience on low capability devices such as Hybrid TVs or mobile devices was presented. Natural 360° video experience requires the spectator to be able to freely change her individual perspective of view, which requires the full spherical image of any direction of view to be available in every moment. The options discussed are to render the view on the server side and stream only the selected 360° video content to the end device, or to do the 360° video processing on the client side using the existing W3C Media Source Extensions API that makes it possible to implement the entire logic of the 360° player in the browser. The motion-to-photon-latency which depends on different factors like network latency, buffering strategies and segment duration is a known limitation.

We identified a need for extensions to the Media Source Extensions API and HTMLVideoElement in order to provide a standards-based 360° experience, or alternatively need to provide hooks in HTMLMediaElement for a native 360° video player to integrate with, including metadata required to render the view correctly, an API to set and get the field of view with associated change events, and possibly zoom factor.

In the path to native spherical video discussion it was noted non-rectangular media formats (including projections) are not standardized, and that the current solution pipeline consists of <video>, WebGL, and WebVR working together, where the app determines and applies the projection. In the near term, libraries such as VR View will abstract this out for developers. It was suggested projection and 3D information should be specified in the container (e.g. proposal for ISO BMFF (MP4) and WebM). The base assumptions are that adaptive support for spherical rendering is desirable for latency, performance, DRM, or other reasons, and that the browser processes the metadata and handles projection. Also, in the spirit of progressive enhancement, it was agreed support for all clients and users, including those with and without headsets, is a must. Finally, two approaches for native spherical video were discussed: the simplest approach is to let the browser provide UX to select spherical rendering, while the more complete and powerful approach is a spherical DOM presentation, where a spherical <video> is like any other element.

Immersive audio

See also: Presentations and Minutes

Immersive audio panel discussed the current status and plans for immersive audio on the Web, solicited feedback from the participants to inform related standards work e.g. in the W3C Audio Working Group.

In this session we discussed the ongoing work at the W3C needed to make sound a first class citizen for VR on the web: a standard API for audio processing, how to build a high quality spatial audio renderer on top of this API, and the importance of object-based audio and spatialisation technique for great linear VR experiences.

The session saw three presentations: first we heard from Raymond Toy (Google, and Editor in the W3C Audio WG) about the Web Audio API specification aiming to provide native capability for audio processing in the browser, including spatial effect. Raymond mentioned that the group was currently re-chartering and keen to get input on the next version of the API. This was followed by a few questions, mainly on implementation issues for spatialisation features. We then heard from Hongchan Choi about Google’s Omnitone project, a spatial audio rendered built exclusively with the native nodes of the Web Audio API. Finally, we heard from Mike Assenti (Dolby Labs) on how to create immersive linear VR experiences by using, positioning and processing object-based audio. This was followed by a Q&A session with the panel which included discussions on the worker-friendliness of the audio API, questions about accessibility and a number of follow-up questions on the talks themselves.

Breakouts

Breakout sessions provided the workshop participants with an opportunity to discuss in smaller groups specific topics identified in the course of the workshop, and deep dive more into the specifics. See the breakout sessions (1) and (2) for full list of sessions, and details from selected sessions below.

Halving the draw calls with WEBGL_multiview (Olli Etuaho, NVIDIA)

See also: Presentation, Strawman proposal, and Minutes

The goal of this breakout was to resolve open questions on what a WebGL version of OVR_multiview should look like and how to display a framebuffer generated using such an extension in an efficient way in WebVR, as well as to understand what kind of changes are required in the canvas or WebVR spec to enable more efficient stereo rendering.

WebGL extension proposal based on OVR_multiview discussed in this breakout promises to halve draw calls in JS. Open questions discussed in this breakout:

The more shared restrictions the more room for accelerating the extension at the lower levels of the stack. With less shader restrictions, more flexibility is given for the application (such as the browser). A possible compromise would be to expose multiple levels of restrictions.

OVR_multiview renders into layers of an array texture, and WebVR requires left and right views to be rendered side by side to a canvas element specified as VRLayer source. The best way to get from two layers to WebVR without triggering extra copies has three options: specify WEBGL_multiview to render side-by-side, new layered swap chain used as VRLayer source, or add layered stereo option into canvas element.

Accessibility for WebVR (Charles LaPierre, Benetech)

See also: Minutes

The goal of this breakout session was to find a path forward which would allow existing assistive technologies such as screen readers, braille displays, etc. to be able to interact with VR and provide an alternative modality to either the information presented visually or auditorily.

Web standards can help direct developers on best accessibility best practices to add accessibility from the beginning and not as an afterthought. APIs to allow assistive technology to get descriptions of objects, scenes etc. for visual impairments, is an unanswered question, as well as how to get to these descriptions and how detailed do they need to be. Examples of accessibility APIs and features beneficial for assistive technology:

Depth Sensing on the Web (Ningxin Hu, Intel / Rob Manson, awe.media)

See also: Minutes, Media Capture Depth Stream Extensions spec, and Spatial mapping demo

Depth cameras are able to provide a depth map, which provides the distance information between points on an object's surface and the camera. Some cameras are also able to provide a synchronized RGB image along with a depth map. Nowadays, the depth cameras are getting equipped into VR/MR/AR devices, like Leap Motion for Oculus, Project Alloy, HoloLens and Tango, for usages like hand tracking, gesture recognition, collision avoidance and spatial mapping etc., which are essential for immersive VR/MR/AR experiences.

The goal of this breakout session was to review the depth sensing capabilities exposed the Web, identify the gaps for VR/MR/AR usages and discuss about the potential working areas and solutions.

The Media Capture Depth Stream Extensions spec focuses on exposing raw depth data to Web. The spec also adds the depth camera intrinsics, like field of view, focal length, near/far and color camera synchronization to support depth-based computer vision like in point cloud library. When integrating depth camera to HMD, the requirements of location of sensors, synchronization with IMU/HMD, framerates, coordination conversation come up.

For some high level usages, like real-time spatial mapping, it is quite challenging in terms of performance and power if processing the raw depth data in done in JS/WebAssembly on CPU. The native applications commonly leverage GPU parallelism or offload processing to a dedicated co-processor (e.g. Holographic Processing Unit - HPU of HoloLens). This calls for a higher-level API optimized for these usages. As an example, an experimental spatial mapping Web API demo was presented (https://youtu.be/pXyDiYJO0nA) that surfaces the environmental mesh data in array buffer and upload it to WebGL shader.

For next steps, a combined interest group with other related initiatives (WebGL, getUserMedia, WebVR, Permissions) was proposed.

High-Performance Processing on the Web (Philipp Slusallek, DFKI)

See also: Minutes, shade.js, Xflow

The motivation for this session was the need for compute intensive workloads in WebVR applications for which current hardware capabilities are not yet exposed and exposure might be difficult. The ability to take advantage of HW capabilities such as SIMD instructions, GPU compute, or other HW accelerators is not only interesting for better performance but likely will also reduce power consumption, which is of particular importance on mobile platforms. Examples for such workloads are non-trivial animation computations; physics computation; image and video processing (e.g. for AR); spatialized audio effects; particle, lens, and special effects; and other application specific computations. The session focused mainly on data parallelism as thread level parallelism seems to be well covered by WebWorkers.

The following observations were discussed during the session:

Current JS is not sufficient for this task as it cannot express the semantics of the parallelism needed for the above applications.

The use of high-level and well-known languages such as (restricted forms of) JS seems to be the best way to expose such features. This may involve a compiler (in JS or built-in) that can translate that language into the native code that gets executed. The compiler can also implement necessary security measures that may be required. An example for this approach is shade.js, which compiles a restricted form of JS into fragment and vertex shaders for the GPU.

An alternative to a fully programmable approach is based on a flow-graph approach, where some input data flows through a graph of separate “operators” that transform the data at each node, providing new data to the next nodes in the graph. As an add-on the individual nodes may be programmable (see above). This approach combines a declarative approach (configuring the network) with a programmable approach for implementing the individual nodes). A runtime system may then be able to “compile” such a network description into high-performance code. An example is Xflow.

Low-level instructions exposed in JS (such as SIMD.js) may be an enabler for the above technologies but are in themselves too low-level for most Web developer.

It is interesting to see that web browsers already contain (JIT) compilers that could be used as backends for some of these idea via a new frontend in the browser.

WebAssembly seems to be a promising candidate for combining “native” code from traditional compilers with the Web. However, it currently does not offer SIMD features of GPU offload. Interestingly there are strong similarities to SPIR-V from Khronos which exactly starts from a GPU point of view. It would be beneficial to bring these two groups together.

Next steps could be the development of (i) accelerated JS libraries (will happen anyway) using enabling technology such as SIMD.js or glsl, (ii) a flow-graph-based approach would allow optimizations beyond individual functions and with finer granularity, (iii) a restricted JS language to express parallelism seems to offer the most value but requires significant, more long-term work.

It was agreed to stay in closer contact also with Khronos (Neil Trevett) to discuss common interests and possible standardization efforts.

Link traversal, are we there yet? (Fabien Benetou, Freelance)

See also: Minutes, Presentation, Link traversal with A-Frame, and Teleportation in A-Frame

Hyperlinks are a defining characteristics of the Web, a concept that links together various pieces of content, required to make a transition from information age, the current paradigm of 2D information, into experimental medium, VR.

The goals of this breakout were to evaluate existing implementations, compare approaches with current specs, define limitations and new needs such as deep linking equivalents and avatar persistence across experiences. Second, to identify what constitutes a minimal set of features a browser must have in order to be able to navigate from one VR experience to another VR experience hosted by potentially different content providers without leaving VR.

To maintain trust on the Web while navigating across VR experiences, we need to preserve: user control, security, and openness. User control means users maintain the freedom to choose where to go, and how to get there. Security considerations and best practices (HTTPS, Content Security Policy to present XSS, anti-phishing) apply equally to WebVR. Openness in the context of WebVR hyperlinking means that various methods of navigating must be preserved (e.g. window.location, <a>, location headers, <meta http-equiv>).

Next steps

See also: Presentations, Minutes, WebVR Community Group, and Declarative WebVR Community Group

The workshop closed off with a VR standardization landscape session that synthesized input from the workshop and presented a view on possible VR-related work items for standardization.

Existing relevant W3C standardization efforts identified includes spatial audio, gamepad, web worker, media streaming, HTML media extensions, low-latency data & AV transfer, identity, depth camera, video worker, as well as color space management, performance metrics, UI security, and payments.

The workshop discussions also pointed out new standardization work that would be beneficial for the uptake of the VR ecosystem on the Web. Proposed new work for the near term included:

Based on feedback from the participants, longer-term standardization targets that warrant further exploration and incubation were also identified. These include 3D Object Model & eventing, declarative 3D in markup, navigation transitions, link traversal metadata, unified user input for VR, gesture recognition framework, handling fonts in 3D context, fine-grained scheduling, ARIA for VR, annotations for VR entities, and identity or avatar management. The incubation of these topics could happen, for example, in existing or new W3C Community Groups.

Furthermore, nascent topics brought forward in discussions that may or may not require standardization in the future included bridging DOM to WebGL, 2D-Web browsing in VR, establishing UI patterns for VR, real/virtual worlds bindings, including authenticity, ownership and geographical control.

Thank You!

The organizers acknowledge with deep gratitude the efforts of those who helped with the organization and execution of this workshop. Special thanks go to the members of the Program Committee for their support and contributions, the workshop host for providing us with top-notch meeting facilities, and to sponsors for their support. Equally, thank you to the unsung heroes, our scribes, who helped document the various sessions, as well as to the media crew. And finally, all the workshop participants who collectively made the workshop such a productive, positive, inspiring and also fun event – you all deserve a big thank you and a pat on the back!