Slide 1
Access purpose - built ML hardware with Web Neural Network API Ningxin Hu Intel Corporation July 2020
Hello everyone, welcome.
I'm Ningxin Hu, a principal engineer at Intel.
I'm participating in the W3C Machine Learning for the Web Community Group .
Today, my topic is about the new web standard proposal, the Web Neural Network API and how it can help web apps and frameworks to access the purpose-built machine learning hardware.
I'd like to thank everyone who contributes to this proposal.
Slide 2
The JS ML frameworks and AI Web apps Paddle.js WebAssembly WebGL/WebGPU CPU GPU ONNX.js TensorFlow.js JS ML Frameworks Web Browser OpenCV.js Semantic Segmentation Object Detection Speech Recognition Noise Suppression 2 Hardware AI Features of Web Apps
As you know, in the last decade, the machine learning, in particular, the deep learning has been getting increasingly important and widely applied in many applications, like computer vision, speech recognition, noise cancellation.
Nowadays, thanks to the emerging JavaScript machine learning frameworks, the web apps now can easily incorporate this innovative usage by running the machine learning models in the web browser.
Underlining those frameworks usually leverage WebAssembly WebAssembly (WASM for short) is a format for programs that can be executed very fast (much faster than JavaScript) in browsers and that can be generated from existing code-base in non JavaScript languages (e.g. C, C++, rust) , WebGL WebGL is a JavaScript API designed to run GPU-accelerated 3D graphics in browsers and can also be used to take advantage of the parallel computing capabilities of GPUs in general, a much needed feature for running Machine Learning models and WebGPU WebGPU is an emerging JavaScript API to interact with GPU capabilities with more in-depth integration than possible with WebGL. These capabilities include their fast parallel computing, a much needed feature for running Machine Learning models to run the machine learning computation on CPU and GPU, respectively.
Slide 3
The purpose - built ML hardware Paddle.js WebAssembly WebGL/WebGPU CPU GPU ONNX.js TensorFlow.js Web Browser OpenCV.js Semantic Segmentation Object Detection Speech Recognition Noise Suppression 3 NPU VPU DSP ML Ext. ML Ext. Hardware JS ML Frameworks AI Features of Web Apps
On the other hand, to the exponential increasement of the computing demand for machine learning workload, the innovation of hardware architecture is advancing very fast.
The machine learning extensions has been added into the CPU and GPU.
A bunch of new dedicated machine learning accelerators is emerging, such as NPU, VPU and DSP DSP can stand for Digital Signal Processing or Digital Signal Processor (a hardware chip specialized in Digital Signal Processing) .
These dedicated accelerators not only help optimize the performance but also help reduce the power consumption.
By taking advantage of these new hardware features, the native apps got very good performance.
So, how about the web?
Slide 4
The performance gap: Web and native 4 85 33 64 12 4 0 10 20 30 40 50 60 70 80 90 Inference Latency ( ms ) Wasm/SIMD128/FP32 NNAPI/CPU/FP32 WebGL/GPU/FP16 NNAPI/GPU/FP16 NNAPI/DSP/INT8 16 X 5.8 X MobileNet * Inference Latency (smaller is better) Smartphone with DSP 33 3.4 26.8 3 1.1 0 5 10 15 20 25 30 35 Inference Latency ( ms ) Wasm/SIMD128/FP32 OpenVINO/CPU/FP32 WebGL/GPU/FP16 OpenVINO/GPU/FP16 OpenVINO/VNNI/INT8 Laptop with VNNI** * Batch size: 1, input size: 224x224, width multiplier: 1.0 9.7 X 24X ** VNNI: Vector Neural Network Instruction 2.6 X 8.9X CPU GPU CPU GPU DSP CPU VNNI Wasm OpenVINO WebGL OpenVINO Wasm WebGL NNAPI NNAPI NNAPI OpenVINO
To compare the performance of web and native, we use MobileNet as workload and measure the inference latency.
For hardware devices, we use a laptop with Vector Neural Network Instructions, as known as VNNI in the CPU, and a smartphone has a DSP.
According to the charts, there is a big performance gap between web and native.
For instance, on the laptop, the native CPU inference is about 10 times faster than WebAssembly for a float32 precision.
The reason behind that is a native can access 256 bit Vector Instruction, however, WebAssembly only has 128 bit.
The native GPU inference is about nine times faster than WebGL for a float16 precision.
That's because optimized machine learning kernels within the GPU driver are not available to WebGL.
On the smartphone, we observed a similar result.
If we go lower precision inference, as known as quantization, which is a widely used technique to optimize the inference performance, the VNNI and DSP are designed to accelerate that.
So, if using the 8 bit precision on the laptop, the native inference can be 24 times faster than web.
And on the smartphone, the DSP inference can be 16 times faster than the web.
Slide 5
The Web is disconnected from ML hardware Paddle.js WebAssembly WebGL/WebGPU CPU GPU ONNX.js TensorFlow.js Web Browser OpenCV.js Semantic Segmentation Object Detection Speech Recognition Noise Suppression 5 NPU VPU DSP ? ML Ext. ML Ext. Hardware JS ML Frameworks AI Features of Web Apps
The JavaScript machine learning frameworks cannot take advantage of these hardware features, that leads to the big performance gap.
It would be good to expose them on web platform.
However, due to the architecture diversity of this new machine learning hardware, it is quite challenging to expose them by the general CPU and GPU compute web APIs.
Given that, we are proposing a new domain specific web API to access the hardware acceleration for machine learning.
Slide 6
WebNN : the architecture view 6 We b NN BNNS/MPS MacOS/iOS DirectML Windows NN API Android CPU GPU ML Accelerators JS ML frameworks Web App Web Browser Native ML API Hardware WebAssembly ONNX Models WebGL/WebGPU TensorFlow Models Other Models OpenVINO Linux TensorFlow.js, ONNX.js etc., ML Ext. ML Ext.
The proposal is a Web Neural Network API, as known as WebNN.
At the first stage, it focuses on the hardware acceleration for the inference.
It introduces the primitives of the deep neural network to the web platform.
The primitives include tensor tensors are a mathematical construct used throughout machine learning algorithms operand.
Tensor operand represents the multi-dimensional array in different data type including float point 32, 16, and integer 32, 8.
The primitives also include a set of tensor operations, such as convolution convolution is a frequently used mathematical operation when running a machine learning model , matrix matrices are a mathematical construct used throughout machine learning algorithms multiplication, pooling, element-wise and activation.
These operations are either compute-intensive or memory bandwidth bounded.
The acceleration of them is critical to the inference performance.
By using these primitives, the JavaScript machine learning framework can define a computational graph.
The graph can represent part or whole of a machine learning inference model.
Then, the framework can use WebNN API to compile and execute the graph for hardware acceleration.
The execution of the WebNN graph can interact with kernels written in WebAssembly or WebGPU compute shader.
With that, the frameworks can be flexible by using the WebNN for hardware acceleration and using WebAssembly, WebGPU for custom operation support.
The primitives of WebNN can be mapped to the native machine learning API available on different operating systems, such as Android Neural Network API, DirectML on Windows, Metal Performance Shader on macOS, iOS, and OpenVINO OpenVINO is an Intel toolking to optimizing machine learning models .
Eventually, these native APIs will talk with compilers and drivers to run these primitives on various machine learning hardware.
Slide 7
WebNN : the programming model 7 Computational Graph Compilation conv2d add relu input output filter bias tmp tmp Execution input output Execution input output operation input Computational Graph Legend: constant output operand Model nn.createModel createCompilation createExecution Buffer Buffer setInput setOutput Buffer Buffer setInput setOutput startCompute startCompute NeuralNetworkContext nn = navigator.ml.getNeuralNetworkContext options nn.input / nn.constant / nn.conv2d / nn.add / nn.relu / ... https://webmachinelearning.github.io/webnn/
WebNN proposal has four major interfaces, the neural network context, model, compilation and execution.
The main programming flow is like this: first, you can get a neural network context object from the global navigator object.
The neural network context object has methods of tensor operand and operations, so you can use that to define the computational graph as the example shows.
Then, you can create a model object based on this graph.
The model object represents the hardware independent form.
For hardware acceleration, you need to create a compilation object, that represents the hardware specific form of this graph.
You can also specify the compilation options, for example, high performance or low power so browser and native API can select an appropriate hardware for you.
After the compilation is done, you can create the execution object.
That represents the inference request bound with the specific input and output buffers.
The current spec support CPU buffer.
The GPU buffer and other type of input output will be supported in the future spec.
Contributions are welcome.
Execution start compute is asynchronous operation, that will hide the latency for the hardware acceleration.
After the promise is solved, the result will be put into the output buffer.
Slide 8
WebNN : the proof - of - concept implementation 8 Renderer Process blink GPU Process NNAPI MPS/ BNNS Model Compilation Execution service Android Impl MacOS Impl Windows Impl Customized Chromium Native API Hardware IPC DirectML Linux Impl OpenVINO ML Accelerators CPU ML Ext. GPU ML Ext. NeuralNetwork Context https://github.com/otcshare/chromium - src
To prove the concept, we experimentally implemented the WebNN API in a customized Chromium browser, we followed the multi-process architecture of Chromium, we implemented the JavaScript interface of WebNN in Blink and the native API backend in GPU process.
The two components talk with each other through the IPC mechanism.
To test the cross-platform interoperability and performance on different devices, this prototype support 4 OSs, including Windows, Android, macOS, and Linux, and can access the CPU, GPU and accelerators on smartphone and PC.
Slide 9
WebNN : the demos 9 WebNN image classification on a laptop with VNNI WebNN image classification on a smartphone with DSP https://intel.github.io/webml - polyfill/examples/image_classification
Let's see some demo of the WebNN prototype.
On the laptop, first, we test the inference performance with WebAssembly SIMD SIMD stands for Single Instruction Multiple Data, an approach to accelerate parallel computing operations on CPUs - a particularly needed feature for running Machine Learning models . Then we test the performance of WebNN CPU backend for a float point 32 inference.
At last, we test the performance of WebNN for integer 8 inference accelerated by VNNI.
It has close-to-native performance.
On the smartphone, we first tested the WebGL based inference.
Then, we inference the same model with WebNN GPU backend.
In the last, we use WebNN to access the DSP for quantized model inference.
It has the near-to-native performance.
Slide 10
WebNN : the PoC performance 10 We b NN CPU Web App VNNI 33 4.1 3.4 26.8 5.4 3 1.6 1.1 0 5 10 15 20 25 30 35 Inference Latency ( ms ) MobileNet * Inference Latency on Laptop with VNNI** (smaller is better) Wasm/SIMD128/FP32 WebNN/OpenVINO/CPU/FP32 OpenVINO/CPU/FP32 WebGL/GPU/FP16 WebNN/OpenVINO/GPU/FP16 OpenVINO/GPU/FP16 WebNN/OpenVINO/VNNI/INT8 OpenVINO/VNNI/INT8 8 X 4.9X * Batch size: 1, input size: 224x224, width multiplier: 1.0 ** VNNI: Vector Neural Network Instruction 16 X CPU GPU CPU VNNI GPU Wasm OpenVINO WebGL OpenVINO OpenVINO WebNN WebNN WebNN OpenVINO
This slide and the next one have the performance numbers of WebNN prototype we collected on laptop and smartphone.
Feel free to check them out.
Slide 11
WebNN : the PoC performance – cont’d 11 85 35 33 64 14 12 6 4 0 10 20 30 40 50 60 70 80 90 Inference Latency ( ms ) MobileNet * Inference Latency on Smartphone with DSP (smaller is better) Wasm/SIMD128/FP32 WebNN/NNAPI/CPU/FP32 NNAPI/CPU/FP32 WebGL/GPU/FP16 WebNN/NNAPI/GPU/FP16 NNAPI/GPU/FP16 WebNN/NNAPI/DSP/INT8 NNAPI/DSP/INT8 4.5 X 10 X 2.4X CPU GPU DSP Wasm NNAPI WebGL NNAPI NNAPI WebNN WebNN WebNN * Batch size: 1, input size: 224x224, width multiplier: 1.0 We b NN Web App NNAPI DSP GPU CPU
As the number shows, by introducing the domain specific primitives and relying on the native machine learning API, WebNN can help access a purpose-built machine learning hardware and close the gap between the web and native.
Slide 12
12 Call for Participation https://www.w3.org/commu nity/webmachinelearning/ https://webmachinelearning. github.io/webnn/
The Web Neural Network API is an incubation within W3C machine learning for the web Community Group.
Thanks for watching and looking forward to your participation.