Slide 1
Proprietary + Confidential on the Web Jonathan Bingham / 2020-07-31 A proposed web standard to Load and Run ML Models 1
Hello, I'm Jonathan Bingham, a product manager at Google.
I'm going to talk about how the web could provide native support for machine learning or ML.
Slide 2
Proprietary + Confidential Proprietary + Confidential 2
This proposal is being incubated in the machine learning community group Community Groups (CG for short) are a type of group W3C makes available for everyone to propose and to join and where a lot of the W3C pre-standardization work happens .
You can read all about it on the website and in GitHub.
Slide 3
Proprietary + Confidential Proprietary + Confidential const modelUrl = 'url/to/ml/model' ; var exampleList = [{ 'Feature1' : value1 , 'Feature2' : value2 }]; var options = { maxResults = 5 }; const modelLoader = navigator . ml . createModelLoader (); const model = await modelLoader . load ( modelUrl ) const compiledModel = await model . compile () compiledModel . predict ( exampleList , options ) . then ( inferences => inferences . forEach ( result => console . log ( result ))) . catch ( e => { console . error ( "Inference failed: " + e ); }); Draft Spec for the Model Loader API 3
The basic concept of the Model Loader API is that a web developer has a pre-trained ML model that's available by URL.
The model itself might've been created by a data scientist at the developer's company or organization or it could be shared as a public model available to anybody online.
With just the model URL, a few lines of JavaScript can load the model and compile it to the native hardware.
After that, the web app can perform predictions with it.
The particular ML model could be anything.
It could perform image classification on a selected photo.
It could detect an abusive comment that a user is typing into a text field, or it could transform a video feed to create augmented reality.
Really any idea you might have.
Slide 4
Proprietary + Confidential Proprietary + Confidential Why use machine learning in a web page as opposed to on the server? According to the TensorFlow.js team: ML in the browser / client side means: lower latency high privacy, and lower serving cost 4
Why would developers want to do this inside the browser on the client side?
Why not do it on the server?
According to the TensorFlow.js TensorFlow.js (TFJS for short) is a JavaScript library to run Machine Learning models team, there are three main reasons.
Lower latency, greater privacy, and lower serving cost.
Latency because no request and response needs to go to the server and come back.
Privacy because all of the data that's fed into the model can live on the device and never go to the servers.
And then lower serving cost because the cost of doing the prediction is not borne by the host website.
The cost of doing the computation is done on the user's device.
None of these benefits are specific to TensorFlow.js.
They apply to any JavaScript library for machine learning on the web.
Here are just a few of the other options.
There are many.
Slide 6
Proprietary + Confidential Proprietary + Confidential 6 Why create a new web standard? Don’t these awesome JavaScript libraries already address the need?
Now you might wonder if there are already so many JavaScript libraries out there for doing ML, why create a new web standard?
After all the community has risen to the occasion and has made these great libraries available.
Standards take a long time to get agreed on and then implemented in browsers.
Why not just use a JavaScript library today?
Slide 7
Proprietary + Confidential Proprietary + Confidential 7 Speed maNJers for ML New hardware enables new applications TPU >> GPU >> CPU Google TPU custom chip
The short answer is speed.
It's all about performance.
ML is compute intensive, faster processors unlock new applications and make new experiences possible.
For certain kinds of computation, ML runs much faster on GPUs than it does on CPUs.
And new tensor tensors are a mathematical construct used throughout machine learning algorithms processing units and other ML specific hardware runs much faster even than GPUs for some workloads.
That's why the major hardware vendors like Intel, Nvidia, Qualcomm, and yes, Google and Apple, are all working on new chips to make ML run faster.
We'd like web developers to be able to get access to the performance benefits too.
The web platform and web standards are how we can enable that.
Slide 8
Proprietary + Confidential Proprietary + Confidential WebGL WASM WASM+SIMD Plain JS iPhone XS 18.1 140 426.4 Pixel 3 77.3 266.2 2345.2 Desktop Linux 17.1 91.5 61.9 1049 Desktop Windows 41.6 123.1 37.2 1117 MacBook Pro 2018 19.6 98.4 30.2 893.5 Inference times for MobileNet in ms. The web provides APIs for acceleration today. They help! Sample TensorFlow.js performance data: 8
We already have evidence that some of the recent web standards have improved performance for ML, sometimes dramatically.
The TensorFlow TensorFlow is a Python framework to build Machine Learning systems team ran some benchmarks comparing plain JavaScript to WebAssembly WebAssembly (WASM for short) is a format for programs that can be executed very fast (much faster than JavaScript) in browsers and that can be generated from existing code-base in non JavaScript languages (e.g. C, C++, rust) , WebAssembly with SIMD SIMD stands for Single Instruction Multiple Data, an approach to accelerate parallel computing operations on CPUs - a particularly needed feature for running Machine Learning models , and WebGL WebGL is a JavaScript API designed to run GPU-accelerated 3D graphics in browsers and can also be used to take advantage of the parallel computing capabilities of GPUs in general, a much needed feature for running Machine Learning models .
The results show that mobile net models run 10 to 20 times faster on WebGL compared to plain JavaScript.
That's a huge performance boost.
That's great.
But these standards are not specific to ML and they weren't created to make ML workloads specifically run faster.
They were created for graphics or for general purpose computing.
The question is, is there room for even more improvement?
If the browser can take advantage of native hardware, that really is optimized for ML.
Slide 9
Proprietary + Confidential Proprietary + Confidential Device: Pixel 3, Android 9, updated on 12/2018, Chromium 70.0.3503 Device configuration: XPS 13 Laptop, CPU: Intel i5-8250U, Ubuntu Linux 16.04, Chromium 70.0.3503 ● Offload heavy ops gets significant speedup ○ Conv2D (90% computation): 5X faster on PC, 3X faster on smartphone ○ Conv2D+DepthwiseConv2D (99% computation): 33X faster on PC, 7X faster on smartphone ● Create bigger graph by connecting more ops gets better performance: ○ Per op graph vs. one graph: 3.5X slower on PC, 1.5X slower on smartphone Running on native hardware can be even faster Source: Ningxin Hu March 14 2019 9
The short answer is yes.
Ningxin at Intel has done probably more benchmarking on this than anyone.
Results that he had produced a year ago showed that running even one or two ML operations that are very compute intensive and common in deep learning in neural networks can lead to much faster performance.
Importantly, the performance gains are even larger than what's available with WebGL or WebAssembly alone.
There are some performance gains that can be unlocked by adding new standards beyond just general purpose computing APIs.
Slide 10
Proprietary + Confidential Proprietary + Confidential How to use Next Steps: 1. Replace body text by either typing directly into table boxes or copy and paste content in from other source Operations APIs for the most compute- intensive, like Conv2D and MatMul 1 2 3 4 Graph API similar to the Android Neural Networks API Model loader API to load a model by URL and execute it with a native engine like CoreML, TFlite, WinML, etc Application-specific APIs, like Shape Detection for barcodes, faces, QR codes Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem How could the web platform accelerate ML? Lower level Higher level 10
The previous slides showed the benefit of accelerating just a few computational operations using some recently added web standards that are very low level.
They provide access to binary execution in the case of WebAssembly or GPUs in the case of WebGL.
There are other ways we could help web developers run ML faster, some a little higher level.
All these different approaches though have benefits and challenges.
Let's look at four of the alternatives.
Slide 11
Proprietary + Confidential Proprietary + Confidential Approach Benefits Challenges 1. Operations ✓ Small, simple API surface ✓ A few operations could provide a large performance gain x No fusing, graph optimizations, or shared memory x Too low-level for web developers to use directly Operation-level APIs for ML 11
We've already seen that operations, this is slide 11 now, can provide a large performance boost.
They're small, simple APIs, which is great.
We could add more low level APIs to accelerate deep learning like convolution convolution is a frequently used mathematical operation when running a machine learning model , matrix matrices are a mathematical construct used throughout machine learning algorithms multiplication.
There's a limit to how much performance gain can be achieved looking at operations individually though.
An ML model is a graph of many operations typically.
If some can happen on GPUs, but others happen on CPUs, it can be expensive to switch back and forth between the two contexts.
Memory buffers might need to be copied, execution handoffs can take time.
Also web developers aren't likely to directly use these low level operation APIs.
They'll typically rely on JavaScript libraries like TensorFlow.js to deal with all of the low level details for them.
Slide 12
Proprietary + Confidential Proprietary + Confidential Approach Benefits Challenges 2. Graph ✓ Allows fusing, graph optimizations, and shared memory x 100+ operations to standardize x Large attack surface to secure x ML frameworks need to be able to convert to the standard x Large JavaScript API surface for browsers to implement x Too low-level for web developers to use directly Graph APIs open up more performance gains 12
Next, let's look at graph APIs.
One of the most popular examples of a graph API for ML is the neural network API for Android.
It supports models trained in multiple ML frameworks like TensorFlow and PyTorch PyTorch is one of the most popular Python library for machine learning development, providing a low-level API .
And it can run on mobile devices with various chip sets.
These are all great attributes.
The NN API was the inspiration for the web neural network proposal that's also being incubated in the web ML community group.
By building up a graph of low level operations in JavaScript, and then handing off the whole graph to the browser for execution, it's possible to do smart things like run everything on GPUs or split up the graph and decide which parts to run on which chips.
One big challenge for a graph API is that it has JavaScript APIs for every different operation that's supported.
For perspective, the Android NN API supports over 120 operations.
TensorFlow supports more than 1,000.
That number has been growing by around 20% per year, which makes it really challenging to standardize.
You can learn more about the web neural network API proposal on the website.
There's a whole bunch of information and some active GitHub issues and threads and discussion that's been going on.
Slide 14
Proprietary + Confidential Proprietary + Confidential Approach Benefits Challenges 4. Application-specific ✓ Small, simple API surface ✓ Easy for developers to use directly x Customized models are impossible/hard x Long delay before models are added to the web platform Application-specific ML APIs are easiest for developers 14 Examples of Shape Detection APIs: ● Barcodes ● QR codes ● Faces ● Text in an image ● Features of an image const faceDetector = new FaceDetector ( maxDetectedFaces : 5 , fastMode : false }); try { const faces = await faceDetector . detect ( image ); faces . forEach ( face => drawMustache ( face )); } catch ( e ) { console . error ('Face detection failed:', e ); }
Compared to operations APIs, or a graph API, an application specific API, like the shape detection API is something that a web developer would use directly.
It's super simple.
No JavaScript ML library is required.
You can look at the code snippet on this slide.
To the extent that there are specific ML applications that are common, that developers want access to and that are likely to remain common for many years, it makes sense to provide an easy way to add them to a web app.
Most of the animation in ML though is happening with custom models.
A company might want to optimize their ML models based on their own locale or their product catalog or other features that a high level API would have a hard time accommodating.
You know, there are a small number of APIs that are extremely valuable, that are common, that many people would want to use.
It makes sense to have an API for each of those.
But there's a much larger number of potential models that people would want to run.
And you can't easily make an API for each one of those.
Slide 15
Proprietary + Confidential Proprietary + Confidential Approach Benefits Challenges 3. Model loader ✓ Small, simple JavaScript API surface ✓ Easy for developers to use directly ✓ Allows fusing, graph optimizations, shared memory ✓ Existing ML model formats provide several full specs ✓ Unblocks experimentation and ML evolution x 100+ operations to parse and validate x Large attack surface to secure x CoreML, PyTorch, TFlite, WinML are only partly convertible. x What format(s) to support? There are many... The model loader API balances flexibility and performance 15
The Model Loader API tries to strike the balance between flexibility and developer friendliness.
It provides a small, simple JavaScript service that could stand the test of time.
It supports all of the performance optimizations that you can get with a graph API.
The main difference from the graph API is that it moves the definition for the 100 or more operations from JavaScript, as in the graph API, into the model file format, just stored at a URL.
That file still needs to be parsed and validated by the browser and secured against potentially malicious code.
Machine learning execution engines, like CoreML CoreML is the platform APIs for Machine Learning on Apple OSes (incl iOS and macOS) , TensorFlow, and WinML WinML is the platform APIs for Machine Learning on Windows already know how to execute an ML model in their respective formats and they can be optimized for the underlying operating system.
Slide 16
Proprietary + Confidential Proprietary + Confidential Summarizing the options for ML APIs on the web ■ Building ML-specific APIs into the web can increase performance ■ There are multiple approaches, with tradeoffs ■ The Model Loader API is complementary to graph, operations, and application-specific APIs ■ We don’t know yet which level(s) of API we should propose to a working group. ○ Let’s get feedback from developers 16
Summarizing the options for ML APIs on the web then: the goal for all of them is to increase performance.
That's why we want to have ML specific APIs rather than just do everything in JavaScript using existing general purpose APIs.
There are multiple approaches and each one of them has pros and cons.
The Model Loader API is complementary to the others and we could choose to pursue it in addition to one or more of the other approaches.
We could do all of them, or we could pick one.
We don't know yet though which is really the best.
So I'd like to see us move ahead and get some feedback from developers who are going to actually use these APIs.
Slide 17
Proprietary + Confidential Proprietary + Confidential Caveat: it’s early days and there are big challenges ■ ML is evolving rapidly. ○ New computational operations are being invented and published regularly. ○ Eg, TensorFlow has seen around 20% growth in operations every year. ■ Hardware is evolving too, with tensor processing units and more ■ Backward compatibility guarantees are essential for web standards, and not yet common for ML libraries. ■ ML frameworks each have their own operation sets, and overlap between them is only partial, and conversion is not always possible. ○ The ONNX project ( onnx.ai ) is trying to define a common subset. 17
There's some important caveats.
It's really important to call them out because these are potential showstoppers and we need to be aware of them upfront.
The first major caveat is just that ML is evolving super fast.
New operations and approaches are being invented by researchers and published literally every day.
Hardware is evolving as well.
That's on a slower cycle, of course, because of the cost of fabrication and setting up manufacturing at scale.
But hardware is changing quickly, too.
Meanwhile, the web is forever.
Backwards compatibility is essential.
Just as an example, the neural network API hasn't really solved backwards compatibility despite broad adoption.
The web is just getting started with ML standards.
So it's early.
Finally, there are multiple popular ML frameworks.
TensorFlow is one of them.
CoreML, and WinML are really important too because those are the frameworks that operating system vendors, Apple and Microsoft, support natively.
Each of these frameworks chooses what set of operations to support and they make those decisions independently with their own communities.
That means that they all have chosen differently and are choosing differently, and they're evolving at different rates.
There's only partial overlap from one ML framework to the next in terms of what operations they support.
And conversion is not possible in general.
Fortunately, a subset of those operations can be converted and standardized.
There's an initiative called ONNX that is working on exactly this problem.
But it's worth calling out that conversion is hard and is not going to be a hundred percent complete.
My personal perspective is that the Model Loader API gives us a way to explore ML on the web with full hardware acceleration while all of these uncertainties are being worked out.
Slide 18
Proprietary + Confidential Proprietary + Confidential The current plan ✓ Incubate in the Web ML Community Group ➢ Now: Chrome and Chrome OS are working on an experimental build with TFlite integration ○ Coordinate with WebNN API efforts ■ Next: shim the Model Loader API on top ○ Goal: alternate model formats and execution engines are possible ○ Run benchmarking to measure the performance gains ■ Make a custom build available to developers ■ Gather feedback 18
I want to conclude with a status report about where the Model Loader API is in the standards process.
Currently, it's incubating in a community group.
Engineers in Chrome and ChromeOS are working on an experimental browser build with TF-lite integration as a backend.
It's a goal to be able to support other ML engines as well.
There's a bunch of tactical things to work out like process isolation, file validation, security protection.
Once those have been addressed, the next step would be to implement the Model Loader API on top of it and to make a custom browser build available so that developers could take a look at it.
We'd really like to get feedback from some early developers to understand what we can do to make a better API and whether this is the right level of abstraction.
Slide 19
Proprietary + Confidential Thank You binghamj@google.com github.com/webmachinelearning/model-loader 19