W3C Workshop on Web and Machine Learning

Fast client-side ML with TensorFlow.js - by Ann Yuan (Google)

Previous: Introducing WASI-NN All talks Next: ONNX.js - A Javascript library to run ONNX models in browsers and Node.js

    1st

slideset

Slide 1 of 40

Hi, my name is Ann, and I'm a software engineer at Google.

I work on tensorflow.js, a JavaScript library for machine learning.

Since we launched in March of 2018, we've seen tremendous adoption by the JavaScript community with over two million downloads on NPM.

Meanwhile, a distinctive class of machine learning applications has emerged that leverage the unique advantages of on-device computation, such as access to sensor data and preservation of user privacy.

In this talk, I'll discuss how TFJS brings high performance, machine learning to a JavaScript through standard and emerging web technologies, including WebAssembly, WebGL, and WebGPU.

TFJS defines an API for neural network operations, such as matrix multiplication, exponentiation, etc.

These operations call into kernels, which implement the mathematical functions comprising the operation for a particular execution environment.

For example, WebAssembly or WebGL.

A TFJS backend is a collection of such kernels that are defined for the same environment.

Before diving into the details of our different backends, I'd like to provide an overview of how they compare in terms of performance.

This table shows how our WebGL, WebAssembly, and plain JS backends compare when it comes to inference on MobileNet, a medium-sized model with a few hundred million multiply-add operations.

For this model, our WebAssembly backend is between 3 and 11 times faster than our plain JS backend.

Our WebAssembly backend is between 5 and 8 times slower than our WebGL backend.

This next table shows how our backends compare when it comes to inference on face detector, a much smaller model with only around 20 million multiply-add operations.

In this case, our WebAssembly backend is between 8 and 20 times faster than our plain JS backend.

And it's actually comparable with our WebGL backend.

For example, on a 2018 MacBook Pro, our WebAssembly backend is twice as fast as our WebGL backend.

With SIMD enabled, it's 3 times faster.

These benchmarks demonstrate that there is no one size-fits-all technology for machine learning on the web.

The best choice of execution environment depends on many factors, including the model architecture and the device.

Technologies such as WebAssembly and WebGL address different use cases.

And we as TFJS developers must invest in a wide swath of technologies in order to meet our users' needs.

Now I'll go into the details of some of our backends, starting with WebAssembly.

Our WebAssembly backend kernels are written in C++, and compiled with EMScripten.

We use XNNPACK, a highly optimized library of neural network operators for further acceleration.

As we've seen, our WebAssembly backend is ideally suited for lighter models.

And in the last year, we've seen a wave of such production quality models designed for Edge devices.

But WebAssembly is steadily closing the performance gap with WebGL for larger models as well.

A few weeks ago, we added support for SIMD instructions to our WebAssembly backend.

This led to a 3X performance boost for MobileNet, and a 2X performance boost for face detector.

We're also actively working on adding support for multithreading through SharedArrayBuffer.

According to our internal benchmarks, multithreading will provide an additional 3x performance boost for MobileNet, and 2x performance boost for face detector.

For these reasons, we expect adoption of our WebAssembly backend to continue to grow.

We're eager for more users to enjoy the benefits of SIMD and multithreading.

We're also closely following the progress of several evolving specifications in WebAssembly, including flexible vectors for Wider SIMD, quasi-fused multiply-add, and pseudo minimum and maximum instructions.

We're also looking forward to ES6 module support for WebAssembly modules.

TFJS also offers a GPU accelerated backend built on top of the WebGL API. We repurpose this API for high performance numerical computation by representing data in the form of GPU textures, and using fragment shaders to execute neural network operations.

As we've seen, our WebGL backend is still the fastest for larger models containing wide operations that justify the fixed overhead costs of shader execution.

Our WebGL backend is complex.

This complexity comes from many sources.

Firstly, WebGL implementations vary significantly across platforms, often with implications for numerical precision.

Much of our code is devoted to hiding these inconsistencies from our users.

Another significant source of complexity in our WebGL backend is manual memory management.

Because GPU resources are not garbage collected, we must explicitly manage resource disposal through reference counting.

To help our users avoid leaking memory, we expose a utility called tf.tidy that takes a function, executes it, and disposes any intermediate resources created by that function.

Despite these measures, memory management remains a source of error in our WebGL backend.

Therefore, we're excited about new proposals for user-defined finalizers that would give us more security against memory leaks.

Finally, the lack of callbacks for asynchronous WebGL texture downloading means we must pull for download completion.

This has implications for both code complexity and performance.

TFJs also offers an experimental backend built on top of the emerging WebGPU standard.

WebGPU represents an exciting opportunity for addressing the pain points of WebGL.

WebGPU promises better performance and a dedicated API for GPU compute.

As the successor to WebGL, WebGPU is designed to operate directly with low-level graphics APIs, such as D3D, Metal, and Vulkan.

The WebGPU shading language will be directly ingested, and will hopefully offer faster shader compilation, compared to WebGL.

This table shows inference speeds for the Posenet model, built on top of the resnet-50 architecture, a large model with several billion multiply-add operations.

These benchmarks demonstrate the reality that WebGPU has not delivered significant out-of-the-box performance gains.

However, the technology is rapidly evolving, and we're continuing to track progress closely.

We're excited about the potential for future machine learning web standards to address the recurring pain points we faced in developing TFJS, including lack of portability and manual memory management.

Such standards also represent an opportunity to address the distinctive needs of machine learning-powered web applications.

For example, TFJS users have increasingly asked for ways to obfuscate their models in order to protect intellectual property.

We also hope that future standards will preserve the features that have made our progress thus far possible, such as detailed profiling and access to low-level APIs that give us the ability to define and debug operations at a granular level.

Alright, that's it for me.

Thank you very much.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Introducing WASI-NN All talks Next: ONNX.js - A Javascript library to run ONNX models in browsers and Node.js

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.