Sync video and hide transcript

Slide 1 of 7

RNNoise, Neural Speech Enhancement,and the BrowserJean-Marc Valinjmvalin@jmvalin.caW3C Workshop on Web and Machine LearningSeptember 2020(the audio for this talk is processed with RNNoise)

Hi, I'm Jean-Marc Valin, I'm the author of RNNoise.

And I'll be talking about neural speech enhancement, through RNNoise, and also about how it affects the browser.

I'm currently employed by Amazon, but I'm giving this talk as an individual.

Speech Enhancement●The signal processing (DSP) way–Spectral estimators, hand-tuned parameters–Works on stationary noise at mid to high SNR●The new deep neural network (DNN) way–Data driven, often large models (tens of MBs)–Handles non-stationary noise, low SNR ●RNNoise: trying to get DNN quality with DSP complexity

Speech enhancement isn't exactly a new topic.

It's been around since the 70s, and traditionally it's done using signal processing.

It uses complicated spectral estimators, usually combined with hand-tune parameters.

And it works pretty decently on stationary noise, at mid to higher SNRs.

The complexity is very low, but the quality is limited.

On the other hand, there's a new approach, which is based on deep neural networks.

It's entirely data driven, so no need to tune all these parameters.

But they use large models, so typically in the 10s of megabytes, and it handles non stationary noise that works at low SNR, so much higher quality than the traditional approach.

But unfortunately, the complexity is quite high.

And RNNoise is a way of trying to get the best of both worlds.

So, trying to get to the same quality as DNN approach with the complexity of the DSP approach.

RNNoise: A Hybrid Solution●Start from conventional DSP approach●Replace complicated estimators with an RNN●Divide spectrum into 22 “critical bands”–Independently attenuate each band●Use “pitch filter” to remove noise between harmonics

RNNoise is really a hybrid solution.

It starts from a conventional DSP approach, and from there, it replaces these complicated estimators with a deep neural network, that includes several fully connected layer as well as three GRU layers.

One of the key tricks to help bring the complexity down, is that the spectrum is divided into 22 critical bands, rather than processing every single frequency bin separately.

And each of these 22 bands is independently attenuated.

So we have a gain for each of these bands, and it controls how each band is modulated.

This works pretty well except for one case, when we have voice speech and we have noise between pitch harmonics.

And to handle that case, we have a pitch filter that acts as a comb filter and removes the noise between the harmonics, to get actual clean speech.

Results (Quality)●Interactive Demo: https://people.xiph.org/~jm/demo/rnnoise/

(fan humming) In terms of results, you can hear here, the effects of RNNoise being toggled on and off, while I'm speaking and typing at the same time with a fan in the background.

You can observe the result in this slide, they are based on test evaluation, so objective evaluation.

And you can also go to this interactive demo, where you can listen to several samples, and also actually try RNNoise on your own voice using JavaScript.

Complexity (48 kHz)●Requires 215 neurons, 88k weights●Based on 10-ms frames●Total complexity: ~40 MFLOPS–DNN (matrix-vector multiply): 17.5 MFLOPS–FFT/IFFT: 7.5 MFLOPS–Pitch search (convolution): 10 MFLOPS●Unoptimized C code–1.3% CPU on x86, 14% CPU on Raspberry Pi 3–Real-time with asm.js via Emscripten

Now let's look at the complexity of RNNoise, for a 48 kilohertz mono input signal.

RNNoise uses 215 neurons, which means 88,000 weights, and it processes audio in frames of 10 milliseconds, which means we have 100 frames per second.

The total complexity in RNNoise is around 40 megaflops.

And the most complex parts, are first the DNN, which is mostly made of matrix-vector products.

And the complexity of that, is around 17 and a half megaflops.

We have FFTs and IFFTs, and those costs around seven and a half megaflops.

And then we have a pitch search, which uses a correlation or convolution and cost around 10 megaflops.

But these are the main parts.

So, if we wanna optimize RNNoise, then these are the things we need to look at.

The current code base you can find on GitHub, is C code, completely unoptimized, not vectorized.

And it still runs with about 1% CPU on X86, about 40% on a Raspberry Pi 3, and we even have a version that runs in real-time in the browser, through EMscripten and JavaScript.

Looking Forward (And Bigger)●RNNoise–DNN could still grow by 100x to 1000x–Need fast matrix-vector product, low overhead●Pure-DNN approaches–Some approaches use large convolutional networks–Up to 10s of GFLOPS (may require GPU)●Vocoder-based re-synthesis–TTS-like systems using denoised acoustic features–WaveRNN/LPCNet: 3-10 GFLOPS, sample latency

Looking forward a bit, RNNoise is really a minimalistic solution, its DNN is really quite small compared to other approaches.

But in the future, you could see systems where it would grow by a factor of 100 or even 1000.

It is mostly made of matrix-vector products, especially if we grow the DNN, the FFT will become negligible.

And so if we want it to run in real time, we need low overhead because we need many of these matrix-vector products every second.

In terms of pure DNN approaches, some of them are using really large convolutional network.

And that involves complexity sometimes up to the 10th of gigaflops, which may even require GPUs in some cases, if we wanted to run in real time.

And there's also a new approach that is emerging, it's not yet clear what will become of this, but these are vocoder-based re-synthesis approaches, where the idea is to de-noise acoustic features, rather than audio, and then use a TTS like vocoder to resynthesize clean speech out of this.

So it could potentially provide much fewer artifacts in the denoised speech.

And if we want that to run in real time, the most promising approaches are through WaveRNN or even LPCNet.

Those involve around 3 to 10 gigaflops, so less than some of the pure DNN approaches, but at the same time, it requires a processing of the sample level, which means that many GPUs will not be able to process that in real time, and we will actually need a CPU because we need to compute the network for every single sample at 16 or 24 or 48 KHz in the future.

Resources●RNNoise source code (BSD): https://github.com/xiph/rnnoise/●Demo page: https://jmvalin.ca/demo/rnnoise/●References–J.-M. Valin, A Hybrid DSP/Deep Learning Approach to Real-Time Full-Band Speech Enhancement, Proc. MMSP Workshop, arXiv:1709.08243, 2018.–S. Maiti, M.I. Mandel, Speaker independence of neural vocoders and their effect on parametric resynthesis speech enhancement, Proc. ICASSP, pp. 206-210, 2020–N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S., Dieleman, K. Kavukcuoglu, Efficient neural audio synthesis, arXiv:1802.08435, 2018.–J.-M. Valin, J. Skoglund, LPCNet: Improving Neural Speech Synthesis Through Linear Prediction, Proc. ICASSP, arXiv:1810.11846, 2019.

That concludes my talk, for those interested, the RNNoise source code is available on GitHub under a BSD license.

You can also have a look at the demo page for many samples as well as some high level explanation.

And you can have a look at some of the references here if you want to read more about RNNoise and some of the topics for this talk.

Thank you.

Keyboard shortcuts in the video player

Play/pause: space
Increase volume: up arrow
Decrease volume: down arrow
Seek forward: right arrow
Seek backward: left arrow
Captions on/off: C
Fullscreen on/off: F
Mute/unmute: M
Seek percent: 0-9

Previous: A virtual character web meeting with expression enhance power by machine learning All talks Next: Empowering Musicians and Artists using Machine Learning to Build Their Own Tools in the Browser

Thanks to Futurice for sponsoring the workshop!

Video hosted by WebCastor on their StreamFizz platform.

matrices are a mathematical construct used throughout machine learning algorithms

convolution is a frequently used mathematical operation when running a machine learning model