W3C Workshop on Web and Machine Learning

Machine Learning on the Web for content filtering applications - by Oleksandr Paraska (eyeo)

Previous: Pipcook, a front-end oriented DL framework All talks Next: Exploring unsupervised image segmentation results

    1st

slideset

Slide 1 of 40

Hello, my name is Oleksandr Paraska.

I work at Eyeo GmbH. It's a company behind the popular browser extension AdBlock Plus.

I will present the use cases for machine learning that we have encountered while working on a problem of content filtering on the web.

One of the motivations that we have behind this presentation is to argue that our use case should be represented within Web Neural Networks API. I hope this will be useful for W3C group, and I hope we can have a discussion after that sometime.

So before I start, I quickly wanted to maybe give a short overview how content filtering on the web works now, and then where does machine learning fit in into that scenario?

So right now, there is a community of filter list authors who craft these filter lists, filter rules that define what needs to be blocked on the web.

And then there is a separate software that downloads those filter rules.

And then when the user uses the software to browse the internet, only the filtered content is rendered.

Essentially, there are two ways you can block content on the web.

There is network level blocking, and there is DOM level blocking I would say.

Network level blocking is essentially a URL classification problem, where you have a list of URLs, maybe with some metadata for each URL.

And then you want to understand if the specific URL should be blocked or not.

And the key point here that I wanted to bring up is that the network level blocking is not the use case that we see for Web Neural Networks API. So we would like to talk only about DOM level blocking.

We have an experiment with network level blocking with using machine learning, but what we found out is that it's at least 50 times slower than our current implementations.

And also it's not clear if there are any benefits of using it.

So that's why we want to focus only on DOM level blocking.

That being said, there is already implementation of machine learning for a similar technique.

And then Safari Intelligent Tracking Protection is probably the best and the most widely used implementation of machine learning for content filtering.

So that's something that is definitely of interest for us.

So about content blocking on the web, there is DOM level ad blocking, and that means that when you have a network request, which delivers both an ad and content, you cannot just block the network request because you will also block the content.

Obviously, you have to deal with the ad then on the higher level, on a DOM level, and there you have multiple options.

You can either apply a CSS selector to hide the specific element, or you can run some JavaScript code so that you can identify which element to hide.

And then as the escalation of advertisers and ad-blockers continued, it became clear that just CSS selectors are often not enough, and you have to move up your game for JavaScript code further and further.

So recently, we have deployed the machine learning model that is running on some websites to block advertising.

So that's why we wanted to talk about our use case for Web Neural Networks API. Essentially, at the DOM level, ad blocking means that you need to find relevant features on the website, and you need to classify which elements need to be hidden based on those features.

And then you have the regular features in the HTML world, which are class names, IDs, and so on.

But machine learning brings you a more powerful language to classify elements.

And I wanted to bring up at least two more ways we can classify elements with machine learning.

So the most obvious way to classify elements once you have machine learning is the perceptual way, where the recent evolution in machine learning gave us more tools and understanding that we now are able to target elements based on how they look, not by their metadata.

And so that means that you can just look at the image and maybe predict if there's something that is an ad or not an ad.

And we have implemented similar algorithm to work in our extension.

We didn't use machine learning for that.

We only used the perceptual hashing.

And perceptual hashing essentially is a technique that will give you a similar hash to images that look similar.

And we have discovered that yes, it works and most websites that we want to apply this, the inventory of ad images is fairly limited.

So we can list all the images or all the hashes, and we can apply these filters.

However, we stumbled upon a problem where there is a canvas tainting issue where we are not able to have access to raw data of the images if the website owner does not allow us to have access.

That's only an issue in Chrome, but it's still a big issue.

In Firefox, we have a way around it.

And so the question now is should there be something in Web Neural Networks API or in general that allows machine learning models to run on these images that are maybe tainted, but if the model does not produce or leaks any information to the user, then maybe that should be fine.

And that's just the question I wanted to raise, and maybe we can discuss it moving forward.

And then I would like to briefly go into the second kind of features that you could use for the element hiding on the web, and that is structural features.

And the reason why we want to use structural features is because the image they show here, it's an image from the paper Percival where they have instrumented the Chromium browser to run an inference on every image that is being rendered by the rendering pipeline.

And they are essentially classifying every image in the rendering pipeline, and that seems to work.

But what you clearly can see from the paper is that it's not possible to just classify based on image data alone, because same image can be both an ad, and not an ad based on the context.

So you need to have some context, which means you have to have structural data around it.

And that's the second kind of features that you would want to use, the structural data.

Structural data is important because it can be used as a feature of its own.

For example, on one of the social networks you have, you can see how the sponsored label is being rendered by a whole bunch of spans and obfuscations, which are essentially impossible to pinpoint using the CSS selectors.

So you have to use some other clever techniques.

One of the techniques that you can use is that you can build a graph of this element of the whole post I would say, and then build a graph of a ad or not an ad.

And then you will have, you will- should be able to classify if something is an ad, just purely based on the structure, because hopefully non ads would not have this obfuscation built in, or there would be other features that structurally are different.

And that's maybe the key insight that we wanted to communicate here that structural information and running machine learning models on structural information is very useful in general.

And also it's very useful for our purposes of content filtering in the web.

The way we see structural information here is that each DOM is a tree, so we can represent it as adjacency matrix.

So you can see that here, for example, a <div> is moved to a top of the tree and then you have the leaves there as <p> and <h1>.

And then because you represented the structure as adjacency matrix, it also can represent the features of each node as a separate matrix, a feature matrix.

So for example, a <div> is number 29 in our use case.

So then we say that 29 is something that you want to have here as the feature for first element, and so on and so on.

So you can imagine that type of an element, is only one kind of a feature.

You can have as many features as you want.

And the more features that you have, the probably the slower the algorithm will be, but essentially that gives you a lot of power to target these specific elements.

So once you are able to work with the data like this, you are able to either, solve graph isomorphism problems, or you are able to solve node classification problems.

And those are the two things that graph convolution neural networks are particularly good at.

And that those are the things that we are already experimenting with.

So we have a code that is in our extension already that we are able to run, that is already using a built-in machine learning model to classify ads on social networks, based on purely structural data.

And on slide seven, I wanted to also communicate that it's not just us who work on these structural problems.

There is a recent paper of AdGraph that is looking for a very similar thing.

There they are using an instrumented Chromium browser to produce graphs.

And then they are classifying those graphs based on the custom extracted features.

So maybe in this case, they're not using the graph convolution neural networks, but the problem of running machine learning on graphs on the web is a very, very fruitful one.

And that's what we wanted to communicate in this presentation.

And on slide eight, I wanted to maybe finalize by saying that it's very fruitful to look at the DOM as a graph.

However, it's not a static graph, it's a dynamic graph because every mutation of an element changes that graph.

So it's also useful to look at the filtering of those mutation as a machine learning problem, or maybe to look at the whole problem as a problem of dynamic graph classification.

And in that case, we wanted to see if there are any other techniques that the community would like to embed into Web Neural Networks API. Then I also wanted to say that we already have this code running, and then if you're interested to have a look, we have a model, a pre-trained model there, and also the code that we're using there.

And lastly, I wanted to say that, as I started with that, everything starts with the community of people who write the filter lists.

And we certainly want to ensure that the community of people as always is also able to maintain the models.

So for that, we are very interested in federated learning problems and how they relate to Web Neural Networks.

And we have experimented with TFJS a little bit, but we wanted to understand how Web Neural Networks API would interact with federated learning problems.

And on that, I will end.

Thank you very much.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Pipcook, a front-end oriented DL framework All talks Next: Exploring unsupervised image segmentation results

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.