Sync video and hide transcript

Slide 1 of 19

Proprietary + Confidentialon the WebJonathan Bingham / 2020-07-31A proposed web standard toLoad and Run ML Models 1

Hello, I'm Jonathan Bingham, a product manager at Google.

I'm going to talk about how the web could provide native support for machine learning or ML.

Proprietary + ConfidentialProprietary + Confidential2

This proposal is being incubated in the machine learning community group.

You can read all about it on the website and in GitHub.

Proprietary + ConfidentialProprietary + Confidentialconst modelUrl ='url/to/ml/model';var exampleList =[{'Feature1': value1,'Feature2': value2}];var options ={ maxResults = 5};const modelLoader = navigator.ml.createModelLoader();const model = await modelLoader.load(modelUrl)const compiledModel = await model.compile()compiledModel.predict(exampleList, options).then(inferences => inferences.forEach(result => console.log(result))).catch(e =>{ console.error("Inference failed: "+ e);});Draft Spec for the Model Loader API3

The basic concept of the Model Loader API is that a web developer has a pre-trained ML model that's available by URL.

The model itself might've been created by a data scientist at the developer's company or organization or it could be shared as a public model available to anybody online.

With just the model URL, a few lines of JavaScript can load the model and compile it to the native hardware.

After that, the web app can perform predictions with it.

The particular ML model could be anything.

It could perform image classification on a selected photo.

It could detect an abusive comment that a user is typing into a text field, or it could transform a video feed to create augmented reality.

Really any idea you might have.

Proprietary + ConfidentialProprietary + ConfidentialWhy use machine learning in a web pageas opposed to on the server?According to the TensorFlow.js team:ML in the browser / client side means:lower latencyhigh privacy, and lower serving cost4

Why would developers want to do this inside the browser on the client side?

Why not do it on the server?

According to the TensorFlow.js team, there are three main reasons.

Lower latency, greater privacy, and lower serving cost.

Latency because no request and response needs to go to the server and come back.

Privacy because all of the data that's fed into the model can live on the device and never go to the servers.

And then lower serving cost because the cost of doing the prediction is not borne by the host website.

The cost of doing the computation is done on the user's device.

Proprietary + Confidential5

None of these benefits are specific to TensorFlow.js.

They apply to any JavaScript library for machine learning on the web.

Here are just a few of the other options.

There are many.

Proprietary + ConfidentialProprietary + Confidential6Why create a new web standard?Don’t these awesome JavaScript libraries already address the need?

Now you might wonder if there are already so many JavaScript libraries out there for doing ML, why create a new web standard?

After all the community has risen to the occasion and has made these great libraries available.

Standards take a long time to get agreed on and then implemented in browsers.

Why not just use a JavaScript library today?

Proprietary + ConfidentialProprietary + Confidential7Speed maNJers for MLNew hardware enables new applicationsTPU >> GPU >> CPUGoogle TPU custom chip

The short answer is speed.

It's all about performance.

ML is compute intensive, faster processors unlock new applications and make new experiences possible.

For certain kinds of computation, ML runs much faster on GPUs than it does on CPUs.

And new tensor processing units and other ML specific hardware runs much faster even than GPUs for some workloads.

That's why the major hardware vendors like Intel, Nvidia, Qualcomm, and yes, Google and Apple, are all working on new chips to make ML run faster.

We'd like web developers to be able to get access to the performance benefits too.

The web platform and web standards are how we can enable that.

Proprietary + ConfidentialProprietary + ConfidentialWebGLWASMWASM+SIMDPlain JSiPhone XS18.1140426.4Pixel 377.3266.22345.2Desktop Linux17.191.561.91049Desktop Windows41.6123.137.21117MacBook Pro 201819.698.430.2893.5Inference times for MobileNet in ms.The web provides APIs for acceleration today. They help!Sample TensorFlow.js performance data:8

We already have evidence that some of the recent web standards have improved performance for ML, sometimes dramatically.

The TensorFlow team ran some benchmarks comparing plain JavaScript to WebAssembly, WebAssembly with SIMD, and WebGL.

The results show that mobile net models run 10 to 20 times faster on WebGL compared to plain JavaScript.

That's a huge performance boost.

That's great.

But these standards are not specific to ML and they weren't created to make ML workloads specifically run faster.

They were created for graphics or for general purpose computing.

The question is, is there room for even more improvement?

If the browser can take advantage of native hardware, that really is optimized for ML.

Proprietary + ConfidentialProprietary + ConfidentialDevice: Pixel 3, Android 9, updated on 12/2018, Chromium 70.0.3503Device configuration: XPS 13 Laptop, CPU: Intel i5-8250U, Ubuntu Linux 16.04, Chromium 70.0.3503●Offload heavy ops gets significant speedup○Conv2D (90% computation): 5X faster on PC, 3X faster on smartphone○Conv2D+DepthwiseConv2D (99% computation): 33X faster on PC, 7X faster on smartphone●Create bigger graph by connecting more ops gets better performance:○Per op graph vs. one graph: 3.5X slower on PC, 1.5X slower on smartphoneRunning on native hardware can be even fasterSource: Ningxin Hu March 14 20199

The short answer is yes.

Ningxin at Intel has done probably more benchmarking on this than anyone.

Results that he had produced a year ago showed that running even one or two ML operations that are very compute intensive and common in deep learning in neural networks can lead to much faster performance.

Importantly, the performance gains are even larger than what's available with WebGL or WebAssembly alone.

There are some performance gains that can be unlocked by adding new standards beyond just general purpose computing APIs.

Proprietary + ConfidentialProprietary + ConfidentialHow to use Next Steps:1.Replace body text by either typing directly into table boxes or copy and paste content in from other sourceOperations APIs for the most compute- intensive, like Conv2D and MatMul1234Graph API similar to the Android Neural Networks APIModel loader API to load a model by URL and execute it with a native engine like CoreML, TFlite, WinML, etcApplication-specificAPIs, like Shape Detection for barcodes, faces, QR codesSource: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat semHow could the web platform accelerate ML?Lower levelHigher level10

The previous slides showed the benefit of accelerating just a few computational operations using some recently added web standards that are very low level.

They provide access to binary execution in the case of WebAssembly or GPUs in the case of WebGL.

There are other ways we could help web developers run ML faster, some a little higher level.

All these different approaches though have benefits and challenges.

Let's look at four of the alternatives.

Proprietary + ConfidentialProprietary + ConfidentialApproachBenefitsChallenges1. Operations✓Small, simple API surface✓A few operations could provide a large performance gainxNo fusing, graph optimizations, or shared memoryxToo low-level for web developers to use directlyOperation-level APIs for ML11

We've already seen that operations, this is slide 11 now, can provide a large performance boost.

They're small, simple APIs, which is great.

We could add more low level APIs to accelerate deep learning like convolution, matrix multiplication.

There's a limit to how much performance gain can be achieved looking at operations individually though.

An ML model is a graph of many operations typically.

If some can happen on GPUs, but others happen on CPUs, it can be expensive to switch back and forth between the two contexts.

Memory buffers might need to be copied, execution handoffs can take time.

Also web developers aren't likely to directly use these low level operation APIs.

They'll typically rely on JavaScript libraries like TensorFlow.js to deal with all of the low level details for them.

Proprietary + ConfidentialProprietary + ConfidentialApproachBenefitsChallenges2. Graph✓Allows fusing, graph optimizations, and shared memoryx100+ operations to standardizexLarge attack surface to securexML frameworks need to be able to convert to the standardxLarge JavaScript API surface for browsers to implementxToo low-level for web developers to use directlyGraph APIs open up more performance gains12

Next, let's look at graph APIs.

One of the most popular examples of a graph API for ML is the neural network API for Android.

It supports models trained in multiple ML frameworks like TensorFlow and PyTorch.

And it can run on mobile devices with various chip sets.

These are all great attributes.

The NN API was the inspiration for the web neural network proposal that's also being incubated in the web ML community group.

By building up a graph of low level operations in JavaScript, and then handing off the whole graph to the browser for execution, it's possible to do smart things like run everything on GPUs or split up the graph and decide which parts to run on which chips.

One big challenge for a graph API is that it has JavaScript APIs for every different operation that's supported.

For perspective, the Android NN API supports over 120 operations.

TensorFlow supports more than 1,000.

That number has been growing by around 20% per year, which makes it really challenging to standardize.

Proprietary + Confidential

You can learn more about the web neural network API proposal on the website.

There's a whole bunch of information and some active GitHub issues and threads and discussion that's been going on.

Proprietary + ConfidentialProprietary + ConfidentialApproachBenefitsChallenges4. Application-specific✓Small, simple API surface✓Easy for developers to use directlyxCustomized models are impossible/hardxLong delay before models are added to the web platformApplication-specific ML APIs are easiest for developers14Examples of Shape Detection APIs: ●Barcodes●QR codes●Faces●Text in an image●Features of an imageconst faceDetector =new FaceDetector( maxDetectedFaces:5, fastMode:false});try{const faces =await faceDetector.detect(image); faces.forEach(face => drawMustache(face));}catch(e){ console.error('Face detection failed:', e);}

Compared to operations APIs, or a graph API, an application specific API, like the shape detection API is something that a web developer would use directly.

It's super simple.

No JavaScript ML library is required.

You can look at the code snippet on this slide.

To the extent that there are specific ML applications that are common, that developers want access to and that are likely to remain common for many years, it makes sense to provide an easy way to add them to a web app.

Most of the animation in ML though is happening with custom models.

A company might want to optimize their ML models based on their own locale or their product catalog or other features that a high level API would have a hard time accommodating.

You know, there are a small number of APIs that are extremely valuable, that are common, that many people would want to use.

It makes sense to have an API for each of those.

But there's a much larger number of potential models that people would want to run.

And you can't easily make an API for each one of those.

Proprietary + ConfidentialProprietary + ConfidentialApproachBenefitsChallenges3. Model loader✓Small, simple JavaScript API surface✓Easy for developers to use directly✓Allows fusing, graph optimizations, shared memory✓Existing ML model formats provide several full specs✓Unblocks experimentation and ML evolutionx100+ operations to parse and validatexLarge attack surface to securexCoreML, PyTorch, TFlite, WinML are only partly convertible. xWhat format(s) to support? There are many...The model loader API balances flexibility and performance15

The Model Loader API tries to strike the balance between flexibility and developer friendliness.

It provides a small, simple JavaScript service that could stand the test of time.

It supports all of the performance optimizations that you can get with a graph API.

The main difference from the graph API is that it moves the definition for the 100 or more operations from JavaScript, as in the graph API, into the model file format, just stored at a URL.

That file still needs to be parsed and validated by the browser and secured against potentially malicious code.

Machine learning execution engines, like CoreML, TensorFlow, and WinML already know how to execute an ML model in their respective formats and they can be optimized for the underlying operating system.

Proprietary + ConfidentialProprietary + ConfidentialSummarizing the options for ML APIs on the web■Building ML-specific APIs into the web can increase performance■There are multiple approaches, with tradeoffs■The Model Loader API is complementary to graph, operations, and application-specific APIs■We don’t know yet which level(s) of API we should propose to a working group. ○Let’s get feedback from developers16

Summarizing the options for ML APIs on the web then: the goal for all of them is to increase performance.

That's why we want to have ML specific APIs rather than just do everything in JavaScript using existing general purpose APIs.

There are multiple approaches and each one of them has pros and cons.

The Model Loader API is complementary to the others and we could choose to pursue it in addition to one or more of the other approaches.

We could do all of them, or we could pick one.

We don't know yet though which is really the best.

So I'd like to see us move ahead and get some feedback from developers who are going to actually use these APIs.

Proprietary + ConfidentialProprietary + ConfidentialCaveat: it’s early days and there are big challenges■ML is evolving rapidly. ○New computational operations are being invented and published regularly.○Eg, TensorFlow has seen around 20% growth in operations every year.■Hardware is evolving too, with tensor processing units and more■Backward compatibility guarantees are essential for web standards, and not yet common for ML libraries.■ML frameworks each have their own operation sets, and overlap between them is only partial, and conversion is not always possible.○The ONNX project (onnx.ai) is trying to define a common subset.17

There's some important caveats.

It's really important to call them out because these are potential showstoppers and we need to be aware of them upfront.

The first major caveat is just that ML is evolving super fast.

New operations and approaches are being invented by researchers and published literally every day.

Hardware is evolving as well.

That's on a slower cycle, of course, because of the cost of fabrication and setting up manufacturing at scale.

But hardware is changing quickly, too.

Meanwhile, the web is forever.

Backwards compatibility is essential.

Just as an example, the neural network API hasn't really solved backwards compatibility despite broad adoption.

The web is just getting started with ML standards.

So it's early.

Finally, there are multiple popular ML frameworks.

TensorFlow is one of them.

CoreML, and WinML are really important too because those are the frameworks that operating system vendors, Apple and Microsoft, support natively.

Each of these frameworks chooses what set of operations to support and they make those decisions independently with their own communities.

That means that they all have chosen differently and are choosing differently, and they're evolving at different rates.

There's only partial overlap from one ML framework to the next in terms of what operations they support.

And conversion is not possible in general.

Fortunately, a subset of those operations can be converted and standardized.

There's an initiative called ONNX that is working on exactly this problem.

But it's worth calling out that conversion is hard and is not going to be a hundred percent complete.

My personal perspective is that the Model Loader API gives us a way to explore ML on the web with full hardware acceleration while all of these uncertainties are being worked out.

Proprietary + ConfidentialProprietary + ConfidentialThe current plan✓Incubate in the Web ML Community Group➢Now: Chrome and Chrome OS are working on an experimental build with TFlite integration○Coordinate with WebNN API efforts■Next: shim the Model Loader API on top○Goal: alternate model formats and execution engines are possible○Run benchmarking to measure the performance gains■Make a custom build available to developers■Gather feedback18

I want to conclude with a status report about where the Model Loader API is in the standards process.

Currently, it's incubating in a community group.

Engineers in Chrome and ChromeOS are working on an experimental browser build with TF-lite integration as a backend.

It's a goal to be able to support other ML engines as well.

There's a bunch of tactical things to work out like process isolation, file validation, security protection.

Once those have been addressed, the next step would be to implement the Model Loader API on top of it and to make a custom browser build available so that developers could take a look at it.

We'd really like to get feedback from some early developers to understand what we can do to make a better API and whether this is the right level of abstraction.

Proprietary + ConfidentialThank Youbinghamj@google.comgithub.com/webmachinelearning/model-loader19

Thank you for listening.

This is the end of the talk.

You can read more about the Model Loader API in the WebML Community Group site on GitHub.

Thank you.

Keyboard shortcuts in the video player

Play/pause: space
Increase volume: up arrow
Decrease volume: down arrow
Seek forward: right arrow
Seek backward: left arrow
Captions on/off: C
Fullscreen on/off: F
Mute/unmute: M
Seek percent: 0-9

Previous: Access purpose-built ML hardware with Web Neural Network API All talks Next: SIMD operations in WebGPU for ML

Thanks to Futurice for sponsoring the workshop!

Video hosted by WebCastor on their StreamFizz platform.

tensors are a mathematical construct used throughout machine learning algorithms

convolution is a frequently used mathematical operation when running a machine learning model

matrices are a mathematical construct used throughout machine learning algorithms

W3C Workshop on Web and Machine Learning

A proposed web standard to load and run ML models on the web - by Jonathan Bingham (Google)