W3C Workshop on Web and Machine Learning

Accelerate ML inference on mobile devices with Android NNAPI - by Miao Wang (Google)

Previous: Accelerated graphics and compute API for Machine Learning - DirectML All talks Next: Heterogeneous parallel programming with open standards using oneAPI and Data Parallel C++

    1st

slideset

Slide 1 of 40

This is Miao Wang I'm a software engineer working on Android Neural Networks API at Google Today I'm happy to talk about how to accelerate ML inferences on mobile devices with the help of Android Neural Networks API.

This talk will cover the following topics WHat is NN API?

THe current features of NN API THe performance and power impact if you're using NN API and how to use NN API.

So first of all, what is NN API?

As the name suggests NNAPI is intended to run neural networks inferences on hardware accelerators.

NNAPI is a C API. We choosed a C API mainly because they are having stable interfaces and can easily be used by high-level programming languages like java and machine learning frameworks like tensorlow-lite and pytorch-mobile.

As we all know, the ML area is evolving fast there are new concepts, operators and datatypes continuously coming out.

All of this requires NNAPI API are also able to evolve fast.

Additionnally, since the closer to the metal, the harder it is to evolve, we want to make sure existing models and use cases can run well on old and new hardware so backwards compatibility is also important in the API.

Here is a brief history of NN API: NN API 1.0 was introduced with Android O-MR1, it had 29 operators, supports fp32 and asymmetric quantization.

And in Android P, we added a bunch of operators.

With Android Q, there are a lot more operators added, and we started to support fp16 and signed per-channel quantization.

And additionally, developers can use introspection API to query available accelerators on the device and choose which accelerators to use to run inferences.

Also vendors can use vendor extension mechanism to add additional functionalites to NN API. In the a soon-to-be-released Android R we added more operators, we started support signed asymmetric quantization.

Also there are advanced features like Control Flow, Quality of Sservice, memory domains, asynchronous command queue being supported We also made NN API runtime to be updatable APEX module which means we are able to update the runtime much faster than the normal Android update schedule.

The key objective of NN API is to make inferences run fast and efficiently on as many devices as possible.

In order to achieve that, we need to make sure that the inferences running through NN API can run on accelerators available on the device How do we achieve that?

Here is a high-level overview of the architecture of NN API. You can find 2 important interfaces defined by NN API in this architecture: the NDK API interface, and the hardware abstraction, the HAL layer.

Application developers can use the NDK API to interact with NN API runtime, likely through ML frameworks like pytorch-mobile, tensorflow-lite.

I'll talk more about the NDK interface in the How to Use NN API section.

Hardware vendors implement the NN API HAL interface which allow the API runtime to discover available hardware accelerators and interact with them.

The HAL is versionned and backwards compatible, similar to the NDK interfaces.

Currently there are many accelerators already implemented in NN API HAL. That's including the GPU, DSP, TPU and NPU, etc from various hardware vendors and IP providers.

THe NN API runtime is responsible for validating the requests from applications, managing the memory, distributing workloads to available accelerators and it is in charge of interacting with other components in Android OS. You can find more information about the architecture in the link on the slide.

So, let's talk about some performance and power numbers.

As I mentioned earlier, the key objective of NN API is to make inferences run fast and efficiently.

Both performance and power consumption are important for the user experience on mobile devices.

Here is the slide that's showing the numbers of running Google Lens OCR model on Pixel 4 where we are shipping Android Q next year.

So we can see that NN API path is 3X the performance of the optimized CPU kernel of TF-lite.

It also uses 3.7x less power which is critical in this particular use case.

Additionally, the whole model runs on DSP instead of the CPU which frees up the CPU for other workloads if needed.

We can see similarly great improvements for models running on other different SoCs.

Here is an example of MLKit Face detection model on a device with Mediatek P90 SoC. As we know ML is evolving fast, and there are more and more models running on mobile devices.

And different models running on different devices measure different characteristics on performance.

So we're continually working hard to optimize the software layers and introduce new features to make inferences faster.

Now in order to get all the performance gains, we need to know how to use NN API. Let's talk about that.

Well, due to limited time of this talk, I can only briefly talk about different ways of using NN API. For detailed tutorials and documentations, especially for the advanced features, please refer to the links on the slides.

First of all, the developers can use NN API directly.

All NN API functions and types start with ANeuralNetworks and the general workflow of the code is something like the below: well, we first create and define a compute graph called ANeuralNetworksModel.

And then we can create the compilation object which is ANeuralNetworksCompilation from the model and after we created the compilation object, we can then create execution object from the compilation object to run and manage each individual inference.

But all this involves lots of boilerplate code if you want to make sure that the whole model is being implemented in NN API directly.

So there is an easier way to use NN API that's via the Machine Learning frameworks like PyTorch-mobile, TensorFlow-Lite.

You can also use NN API with the other high-level APIs like WebNN.

If you're using TensorFlow-Lite, there are a couple lines of change to enable NnApiDelegate which can automatically detect the supported operations and run the supported operations through NNAPI.

That's all I have for today, thanks a lot and let me know if you have any questions.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Accelerated graphics and compute API for Machine Learning - DirectML All talks Next: Heterogeneous parallel programming with open standards using oneAPI and Data Parallel C++

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.