W3C Workshop on Web and Machine Learning

A virtual character web meeting with expression enhance power by machine learning - by Zelun Chen (Netease)

Previous: AI-Powered Per-Scene Live Encoding All talks Next: RNNoise, Neural Speech Enhancement, and the Browser

    1st

slideset

Slide 1 of 40

First I would introduce myself briefly.

I'm Zelun Chen from Netease and I'm a front-end and client development engineer.

Today, the topic of my presentation is An Online Virtual Character Web Meeting With Expression Enhance Power By Machine Learning where we use WebAssembly to run Machine Learning Algorithm on the browser.

Now I'll give my presentation.

First, let me explain the background of our project.

Since the pandemic has caused great inconvenience for holding large conferences, we are thinking about building a meeting scene within our game scene so that the lecturer and the audience can join the conference via virtual images and the online conference could be vivid and immersive.

As is shown in this picture, this is our meeting scene that we have built.

There are audience seats and there is a podium.

The lecturer can deliver a speech on the podium and display his or her presentation using the screen in the middle of it.

In the lower left corner is the virtual image of the lecturer.

So far, the faces of the audience and lecturers have been stiff without rich facial expressions.

That's not quite what we're looking for in our immersive experience Now let me introduce our framework.

Our framework is as follows: First, the audience and the lecturer go to the web page in their browser and connect to a remote game client via WebRTC.

Then their sounds can be transmitted through WebRTC from their local microphones to the game's audio module to allow lectures and Q&As.

After building the online meeting scene in this way, we've been thinking that a successful meeting takes more than just these basic elements.

We also need to enhance interactions among attendees to create a vivid and immersive meeting, for example, to allow the audience to turn around and see the look on other people's faces, or to say hello to each other, or even to see the face of the lecturer.

After thinking for some time, it occurs to us that we can use the browser to capture faces from their cameras and transmit them to virtual images so that everyone can see.

Our algorithm group happens to have a machine-learning-based library of emoji transmission, so we were wondering if we could run this library on our browser.

So here's how we're going to run this whole algorithmic engineering.

First we need to use camera to capture video and send it to our algorithm to get model node parameters.

Then we render these node parameters onto this virtual image in browser and repeat this step so as to convert real-time facial data into a web model.

After that, we were thinking about how we can run this engineering on the Web.

As our machine learning algorithm is based on C++ and only uses Opencv library and a machine learning inference engine internally, all we need to do is to run this C++ code on Web.

So we came up with three solutions.

The first one is to rewrite our source code with Javascript referencing OpencvJs library and the machine learning Inference engine Js lib.

But this requires that we know both C++ and JS development and the essential difference between them, which is a lot of work.

So we didn't adopt this solution in the end.

The second solution is to deploy algorithmic engineering on the server and to invoke it through back end Api interface.

But since we need real-time facial data to build the model, there will be network request time latency, so we didn't adopt this solution, either.

The third solution is to compile the entire C++ library to WebAssembly and directly run it on the browser.

In this way, the problems of the first two solutions won't occur.

Eventually, we adopted the third solution.

So here is a routine way of compiling C++ to WebAssembly as shown in the current page.

We pack it up as a WASM file and run it on our browser.

Here are the results we got.

The first is how it works on the browser.

As you can see, it doesn't work very well on the browser.

It takes about two seconds to convert facial data into model data, so it looks a bit laggy.

Then let's take a look at what we did on Unity, which is what we expect to see.

We can tell by comparing the two videos that our engineering doesn't work very well on WebAssembly.

So here we raise a couple of questions about running WebAssembly on Web in our algorithmic engineering to discuss with you.

First of all, as a front-end development engineer, when packing the entire C++ algorithm engineering, you need to know a lot of knowledge of CMake, CLang, and Cross Compile, and these libraries' source code, because some of these methods from outside libraries cannot run on WebAssembly.

And here we have an idea that maybe we can build an library environment like Npm to version the LLVM-complied code library written by a developer so that WebAssembly developers can use these libraries easily whether they have knowledge of C++ or not.

Second, since most of our inference libraries support OpenMp instead of PThread, we cannot use WebAssembly's multi threading to optimize.

There are also biological differences to be dealt with.

Third, we have encountered bad performance on matrix calculation, and would love to hear if you have any suggestions to optimize it.

Fourth, our WASM file is a source code that can be acquired through reverse engineering and decompilation in browser, which lacks support on encrypt solution and does not guarantee the safety of our algorithmic engineering.

Do you have any possible solutions to this?

Those are the four problems that we're dealing with.

We have come up with some solutions to improve its performance.

For example, we first thought about using some efficient Javascript libraries and use EM_JS to call Javascript via WebAssembly.

This is certainly not a good solution as it requires communication between WASM and JS files and those that call this library all need to be modified.

Can we use Web common algorithm library to write some web development methods that automatically call underlying ES_JM when the environment is WebAssembly?

Second, we tried to use SIMD experimental features on the latest browser.

However, it didn't meet our expectations.

First of all, the audience and the lecturer may not be using the most up-to-date browser.

Second, it doesn't make much difference to our computational efficiency.

Do you have any suggestions or opinions?

Please feel free to tell us, thank you.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: AI-Powered Per-Scene Live Encoding All talks Next: RNNoise, Neural Speech Enhancement, and the Browser

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.