W3C Workshop on Web and Machine Learning

RNNoise, Neural Speech Enhancement, and the Browser - by Jean-Marc Valin ()

Previous: A virtual character web meeting with expression enhance power by machine learning All talks Next: Empowering Musicians and Artists using Machine Learning to Build Their Own Tools in the Browser

    1st

slideset

Slide 1 of 40

Hi, I'm Jean-Marc Valin, I'm the author of RNNoise.

And I'll be talking about neural speech enhancement, through RNNoise, and also about how it affects the browser.

I'm currently employed by Amazon, but I'm giving this talk as an individual.

Speech enhancement isn't exactly a new topic.

It's been around since the 70s, and traditionally it's done using signal processing.

It uses complicated spectral estimators, usually combined with hand-tune parameters.

And it works pretty decently on stationary noise, at mid to higher SNRs.

The complexity is very low, but the quality is limited.

On the other hand, there's a new approach, which is based on deep neural networks.

It's entirely data driven, so no need to tune all these parameters.

But they use large models, so typically in the 10s of megabytes, and it handles non stationary noise that works at low SNR, so much higher quality than the traditional approach.

But unfortunately, the complexity is quite high.

And RNNoise is a way of trying to get the best of both worlds.

So, trying to get to the same quality as DNN approach with the complexity of the DSP approach.

RNNoise is really a hybrid solution.

It starts from a conventional DSP approach, and from there, it replaces these complicated estimators with a deep neural network, that includes several fully connected layer as well as three GRU layers.

One of the key tricks to help bring the complexity down, is that the spectrum is divided into 22 critical bands, rather than processing every single frequency bin separately.

And each of these 22 bands is independently attenuated.

So we have a gain for each of these bands, and it controls how each band is modulated.

This works pretty well except for one case, when we have voice speech and we have noise between pitch harmonics.

And to handle that case, we have a pitch filter that acts as a comb filter and removes the noise between the harmonics, to get actual clean speech.

(fan humming) In terms of results, you can hear here, the effects of RNNoise being toggled on and off, while I'm speaking and typing at the same time with a fan in the background.

You can observe the result in this slide, they are based on test evaluation, so objective evaluation.

And you can also go to this interactive demo, where you can listen to several samples, and also actually try RNNoise on your own voice using JavaScript.

Now let's look at the complexity of RNNoise, for a 48 kilohertz mono input signal.

RNNoise uses 215 neurons, which means 88,000 weights, and it processes audio in frames of 10 milliseconds, which means we have 100 frames per second.

The total complexity in RNNoise is around 40 megaflops.

And the most complex parts, are first the DNN, which is mostly made of matrix-vector products.

And the complexity of that, is around 17 and a half megaflops.

We have FFTs and IFFTs, and those costs around seven and a half megaflops.

And then we have a pitch search, which uses a correlation or convolution and cost around 10 megaflops.

But these are the main parts.

So, if we wanna optimize RNNoise, then these are the things we need to look at.

The current code base you can find on GitHub, is C code, completely unoptimized, not vectorized.

And it still runs with about 1% CPU on X86, about 40% on a Raspberry Pi 3, and we even have a version that runs in real-time in the browser, through EMscripten and JavaScript.

Looking forward a bit, RNNoise is really a minimalistic solution, its DNN is really quite small compared to other approaches.

But in the future, you could see systems where it would grow by a factor of 100 or even 1000.

It is mostly made of matrix-vector products, especially if we grow the DNN, the FFT will become negligible.

And so if we want it to run in real time, we need low overhead because we need many of these matrix-vector products every second.

In terms of pure DNN approaches, some of them are using really large convolutional network.

And that involves complexity sometimes up to the 10th of gigaflops, which may even require GPUs in some cases, if we wanted to run in real time.

And there's also a new approach that is emerging, it's not yet clear what will become of this, but these are vocoder-based re-synthesis approaches, where the idea is to de-noise acoustic features, rather than audio, and then use a TTS like vocoder to resynthesize clean speech out of this.

So it could potentially provide much fewer artifacts in the denoised speech.

And if we want that to run in real time, the most promising approaches are through WaveRNN or even LPCNet.

Those involve around 3 to 10 gigaflops, so less than some of the pure DNN approaches, but at the same time, it requires a processing of the sample level, which means that many GPUs will not be able to process that in real time, and we will actually need a CPU because we need to compute the network for every single sample at 16 or 24 or 48 KHz in the future.

That concludes my talk, for those interested, the RNNoise source code is available on GitHub under a BSD license.

You can also have a look at the demo page for many samples as well as some high level explanation.

And you can have a look at some of the references here if you want to read more about RNNoise and some of the topics for this talk.

Thank you.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: A virtual character web meeting with expression enhance power by machine learning All talks Next: Empowering Musicians and Artists using Machine Learning to Build Their Own Tools in the Browser

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.