W3C Workshop on Web and Machine Learning

Mobile-first web-based Machine Learning - by Josh Meyer & Lindy Rauchenstein (Artie)

Previous: Exploring unsupervised image segmentation results All talks Next: We Count: Fair Treatment, Disability and Machine Learning

    1st

slideset

Slide 1 of 40

Hello, and welcome to our talk.

Today we are going to introduce you to our approach to machine learning at Artie, Inc.

My name is Josh Meyer and today I'm co-presenting this talk with Lindy Rauchenstein.

We are lead scientists for speech and vision at Artie, Inc.

At Artie, we are working on web-based mobile-first instant games that rely heavily on machine intelligence.

That is, if you are a user, you can click on a link anywhere in the web, and the next instant you are in one of our experiences, playing one of our games, no download required.

When we talk about machine intelligence, we're specifically talking about both conversational intelligence and visual awareness, visual intelligence.

The user is able to interact with the digital character by voice, text, or vision.

They're looking to have fun.

They're looking to have an interesting conversation with some digital character.

In that light, they're going to have a very low tolerance for machine learning models that are either underperforming or have some kind of high latency.

Imagine you're having a conversation with your favorite character, and then all of a sudden that conversation comes difficult because the character is not responding fast enough, or the character's not understanding.

So given our use case and our specific users' requirements, we started to think about how we're going to deploy our solution with respect to our different machine learning models, so that for any given experience we might have a handful of different models from speech and vision to NLP.

And so the traditional approach goes, basically put the big models on the server and put the small models on device.

And taking that a little further, shrink the big models down so you can put those on device too.

So get everything on device.

Well, this approach to shrinking down larger models and putting as many as possible on device, is interesting to us at Artie because it's inherently privacy-preserving, and it offers us a chance to make some gains in latency and data costs for customers.

It does cause some headaches.

If you have a large machine learning model and you try to shrink it down, you usually have to make a choice.

Are you going to make a smaller model that does the same thing as the original model, but just not as well?

Like, you're losing accuracy.

Or do you take that original model and constrain its domain or functionality and take that smaller model and put it on device?

No matter what, if you're shrinking down a model you have to choose.

Are you going to lose accuracy and performance or are you going to lose functionality?

So let's take a concrete example from Artie's machine learning stack and what our thought process is like when we're talking about shrinking down models.

For speech recognition, our interactions call for something that's pretty open domain, large vocabulary, because our users can say just about anything to our digital characters.

This isn't a typical command and control kind of scenario where you're saying up, down, left, and right.

We're trying to elicit conversations from our users.

So we need to recognize that.

Our model right now is made up of two parts.

It's made up of an acoustic model and a language model.

This is pretty standard for speech recognition.

The acoustic model converts audio, sound, raw sound into some kind of probability distribution over letters in the alphabet.

And then the language model converts that into a string of words that is hopefully going to be the correct transcription that was said.

So if we want to shrink this model down, this system, these two models, the acoustic model, starting, is going to be about 180 megs.

And the language model is going to be something more in the gigs range.

Let's say about one gig.

If we want to get something that's as small as possible, but still functional, that still recognizes an open vocabulary, the acoustic model, we might be able to get down to 40 megs, let's say, if we're lucky, and the language model, we could actually just throw that out right away and make our lives easier.

So we have at the end of the day, a large vocabulary model that is just 40 megs.

This model is going to be very underperformant.

And furthermore, it's not going to be small enough for our use case.

We need models that are so small that when the user clicks the link, they're already talking to the character instantaneously.

They're not going to wait for a 40 megabyte download.

So in light of this discussion of why we're not able to shrink down a large vocabulary speech recognizer to something that's instantaneously downloadable, we can't get it in the order of kilobytes by any stretch, we decided to keep that on the backend.

We've decided to keep all of our language stack, including ASR and NLP, on the backend while we can still put our vision models on the front end because we can have models in the order of kilobytes and still have good accuracy.

And this is actually really nice because in the grand scheme of things, users are going to feel much more attached to their video and photo data, the pixel data, from the camera than they are from the voice itself.

So we're able to make a win here for privacy by keeping video vision models on device.

Very briefly, our language stack, the ASR and NLP models that are on the backend server, are built on open source solutions.

For speech in particular, we have a server, a web socket server solution that we call Artie DeepServe that we built around Mozilla's DeepSpeech, where we've actually added to the core DeepSpeech code the ability to do batching inference efficiently, and also the ability to add hot word, hints, so we are able to run a smoother experience that way.

So the language models run on a server, but all of the computer vision models run in the browser on the user's device, usually on a mobile device.

This was firstly motivated by privacy.

So we run facial expression recognition, user engagement, pose estimation, semantic segmentation, object detection models for different game mechanics.

So the camera acquires video during gameplay inside of users' homes.

And for privacy reasons, we aren't interested in transmitting or saving any of that video data off of the device.

We want to process everything privately right there in the web browser.

We also wanted the lowest possible latency.

So to use visual input for natural-feeling things, like having a character that smiles back at you, or that nods at appropriate times in a conversation to acknowledge that they're listening, a delay would feel unnatural.

But the most important constraint is that our product is built on top of Unity's Project Tiny, which is a beta release game engine for the browser.

The game engine is extremely efficient in the web, but one of the things that it constrained us to on the machine learning side is that we cannot do any dynamic memory allocations, and avoiding dynamic memory allocations was the core constraint that led us to choosing the particular machine learning framework that we chose, which was TensorFlow Lite for Microcontrollers.

So TensorFlow Lite is a version of TensorFlow that's widely used in mobile development.

And TensorFlow Lite for Microcontrollers is a subset and an adaptation of that subset, an adaptation from TensorFlow.

So TensorFlow Lite doesn't support training.

It only supports inference.

It doesn't have support for every data type, doesn't include support for double data types, for example.

It doesn't have every operation it would have available when you're building TensorFlow models to run on a server.

But in exchange it's much smaller.

It's optimized to run on ARM Cortex CPU used on mobile phones.

It uses OpenGL to work with GPUs.

It has great model conversion tools that you use for quantizing your networks into eight-bit neural networks.

So you can easily shrink your models by 75% from the 32-bit float versions.

And the version of TensorFlow Lite that we use, TensorFlow Lite for Microcontrollers, is even more constrained in that it has a really tiny runtime, which for us was perfect.

We have a lot of parts of our product that need to download when you play instant game animations and game logic and everything.

So the runtime binary is only 20 kilobytes.

And because the framework is designed to run on microcontrollers, which oftentimes have to run for years on embedded devices, it doesn't continuously allocate in DID memory.

So that's the reason that we chose it for our version of running models on the web over other solutions of TensorFlow.js, for example.

I built an interface from the native code into our game engine and it performs really well.

So to wrap things up, the web-based approach to the machine learning that we use at Artie incorporates a client-side piece provision, a server-side piece for voice, and that works really well for us.

And it gives us a lot of power and flexibility.

We're excited at the prospect of shrinking down as many models as possible, making them faster and more performant.

For natural language, we're a ways away from getting useful models that are in the range of kilobytes.

For vision, we're there already, and we can run MobileNet, for example, very quickly in the browser.

One worthwhile thing to mention is that Unity has another project called Barracuda, which right now can't run in the web-based version of their game engine, but it's a framework for running neural networks through Unity Shaders.

And using Shaders to run neural networks directly is something that is a really great alternative, very efficient, we're continuing to explore at Artie.

I'm looking forward to hearing about all of your experiences and approaches doing machine learning in the web.

Thank you.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Exploring unsupervised image segmentation results All talks Next: We Count: Fair Treatment, Disability and Machine Learning

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.