W3C Workshop on Web and Machine Learning

Access purpose-built ML hardware with Web Neural Network API - by Ningxin Hu (Intel)

Previous: Media processing hooks for the Web All talks Next: A proposed web standard to load and run ML models on the web

    1st

slideset

Slide 1 of 40

Hello everyone, welcome.

I'm Ningxin Hu, a principal engineer at Intel.

I'm participating in the W3C Machine Learning for the Web Community Group.

Today, my topic is about the new web standard proposal, the Web Neural Network API and how it can help web apps and frameworks to access the purpose-built machine learning hardware.

I'd like to thank everyone who contributes to this proposal.

As you know, in the last decade, the machine learning, in particular, the deep learning has been getting increasingly important and widely applied in many applications, like computer vision, speech recognition, noise cancellation.

Nowadays, thanks to the emerging JavaScript machine learning frameworks, the web apps now can easily incorporate this innovative usage by running the machine learning models in the web browser.

Underlining those frameworks usually leverage WebAssembly, WebGL and WebGPU to run the machine learning computation on CPU and GPU, respectively.

On the other hand, to the exponential increasement of the computing demand for machine learning workload, the innovation of hardware architecture is advancing very fast.

The machine learning extensions has been added into the CPU and GPU.

A bunch of new dedicated machine learning accelerators is emerging, such as NPU, VPU and DSP.

These dedicated accelerators not only help optimize the performance but also help reduce the power consumption.

By taking advantage of these new hardware features, the native apps got very good performance.

So, how about the web?

To compare the performance of web and native, we use MobileNet as workload and measure the inference latency.

For hardware devices, we use a laptop with Vector Neural Network Instructions, as known as VNNI in the CPU, and a smartphone has a DSP.

According to the charts, there is a big performance gap between web and native.

For instance, on the laptop, the native CPU inference is about 10 times faster than WebAssembly for a float32 precision.

The reason behind that is a native can access 256 bit Vector Instruction, however, WebAssembly only has 128 bit.

The native GPU inference is about nine times faster than WebGL for a float16 precision.

That's because optimized machine learning kernels within the GPU driver are not available to WebGL.

On the smartphone, we observed a similar result.

If we go lower precision inference, as known as quantization, which is a widely used technique to optimize the inference performance, the VNNI and DSP are designed to accelerate that.

So, if using the 8 bit precision on the laptop, the native inference can be 24 times faster than web.

And on the smartphone, the DSP inference can be 16 times faster than the web.

The JavaScript machine learning frameworks cannot take advantage of these hardware features, that leads to the big performance gap.

It would be good to expose them on web platform.

However, due to the architecture diversity of this new machine learning hardware, it is quite challenging to expose them by the general CPU and GPU compute web APIs.

Given that, we are proposing a new domain specific web API to access the hardware acceleration for machine learning.

The proposal is a Web Neural Network API, as known as WebNN.

At the first stage, it focuses on the hardware acceleration for the inference.

It introduces the primitives of the deep neural network to the web platform.

The primitives include tensor operand.

Tensor operand represents the multi-dimensional array in different data type including float point 32, 16, and integer 32, 8.

The primitives also include a set of tensor operations, such as convolution, matrix multiplication, pooling, element-wise and activation.

These operations are either compute-intensive or memory bandwidth bounded.

The acceleration of them is critical to the inference performance.

By using these primitives, the JavaScript machine learning framework can define a computational graph.

The graph can represent part or whole of a machine learning inference model.

Then, the framework can use WebNN API to compile and execute the graph for hardware acceleration.

The execution of the WebNN graph can interact with kernels written in WebAssembly or WebGPU compute shader.

With that, the frameworks can be flexible by using the WebNN for hardware acceleration and using WebAssembly, WebGPU for custom operation support.

The primitives of WebNN can be mapped to the native machine learning API available on different operating systems, such as Android Neural Network API, DirectML on Windows, Metal Performance Shader on macOS, iOS, and OpenVINO.

Eventually, these native APIs will talk with compilers and drivers to run these primitives on various machine learning hardware.

WebNN proposal has four major interfaces, the neural network context, model, compilation and execution.

The main programming flow is like this: first, you can get a neural network context object from the global navigator object.

The neural network context object has methods of tensor operand and operations, so you can use that to define the computational graph as the example shows.

Then, you can create a model object based on this graph.

The model object represents the hardware independent form.

For hardware acceleration, you need to create a compilation object, that represents the hardware specific form of this graph.

You can also specify the compilation options, for example, high performance or low power so browser and native API can select an appropriate hardware for you.

After the compilation is done, you can create the execution object.

That represents the inference request bound with the specific input and output buffers.

The current spec support CPU buffer.

The GPU buffer and other type of input output will be supported in the future spec.

Contributions are welcome.

Execution start compute is asynchronous operation, that will hide the latency for the hardware acceleration.

After the promise is solved, the result will be put into the output buffer.

To prove the concept, we experimentally implemented the WebNN API in a customized Chromium browser, we followed the multi-process architecture of Chromium, we implemented the JavaScript interface of WebNN in Blink and the native API backend in GPU process.

The two components talk with each other through the IPC mechanism.

To test the cross-platform interoperability and performance on different devices, this prototype support 4 OSs, including Windows, Android, macOS, and Linux, and can access the CPU, GPU and accelerators on smartphone and PC.

Let's see some demo of the WebNN prototype.

On the laptop, first, we test the inference performance with WebAssembly SIMD. Then we test the performance of WebNN CPU backend for a float point 32 inference.

At last, we test the performance of WebNN for integer 8 inference accelerated by VNNI.

It has close-to-native performance.

On the smartphone, we first tested the WebGL based inference.

Then, we inference the same model with WebNN GPU backend.

In the last, we use WebNN to access the DSP for quantized model inference.

It has the near-to-native performance.

This slide and the next one have the performance numbers of WebNN prototype we collected on laptop and smartphone.

Feel free to check them out.

As the number shows, by introducing the domain specific primitives and relying on the native machine learning API, WebNN can help access a purpose-built machine learning hardware and close the gap between the web and native.

The Web Neural Network API is an incubation within W3C machine learning for the web Community Group.

Thanks for watching and looking forward to your participation.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Media processing hooks for the Web All talks Next: A proposed web standard to load and run ML models on the web

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.