W3C Workshop on Web and Machine Learning

Heterogeneous parallel programming with open standards using oneAPI and Data Parallel C++ - by Jeff Hammond (Intel)

Previous: Accelerate ML inference on mobile devices with Android NNAPI All talks Next: Enabling Distributed DNNs for the Mobile Web Over Cloud, Edge and End Devices

    1st

slideset

Slide 1 of 40

Hi, I'm Jeff Hammond from Intel, and I'll be talking about “Heterogeneous Parallel Programming With Open Standards Using oneAPI and Data Parallel C++”.

The motivation for what we're doing here is that we have an ever-increasing diversity and complexity in computer architecture.

This has been going on for 20 years or so with the introduction of multicore and SIMD units, and obviously GPUs and other forms of accelerators.

And I think as this audience knows AI accelerators, special purpose processors have just exploded over the last few years, and there's really no indication that there's gonna be any convergence in architecture or, or simplification.

We're gonna be dealing with this problem for a while now, but, and furthermore, even within families of architectures like GPUs, or, FPGAs, obviously there are different vendors and different programming models and different execution models.

So there's still even within one architectural family, there's still things that one needs to worry about.

So what we really like to do to solve at least part of the problem here is come up with a base software platform that is capable of working everywhere.

It doesn't work perfectly optimally on every processor.

That's not really possible with a single source code, but what we'd like to have is a single software architecture that works at a whole bunch of places.

And you can start with working code and then you can go off and tune it when you need to.

And hopefully you can get a lot of code reuse and you can focus your effort on performance tuning when that's important.

So what Intel is doing with oneAPI is, if you look at the left, is the core screen diagram.

Obviously we have applications.

There's millions of those.

And, and many of those applications share middleware or frameworks; obviously things like TensorFlow are really well known frameworks.

And then there's OpenMP as, as like middleware that people use in high performance computing, and then what oneAPI is, is it's trying to be an industry initiative that, that of course Intel will productize, but it's an open standard specification for a set of things that will live in the area between the frameworks and the middleware and a variety of different architectures.

So we, Intel, make CPUs, GPUs, FPGAs and AI processors, and so there's a, there's a even within our company, there's a big scope of architectures.

And obviously in the greater Silicon ecosystem, there's just a tremendous amount of stuff that could be, could be the bottom of this stack.

Now, if we look at the details, what oneAPI industry specification is trying to do is, is address at least two problems, one of which is direct programming.

So for that, we have something called Data Parallel C++, which I'll explain in just a second.

And then we have API based programming.

Of course, that's the sort of libraries that people are used to.

So if you're used to AI programming, you're probably not writing a great deal of direct code.

You're using a framework; you might be using Python.

You might be calling a deep learning library or a data analytics library.

There's a lot of middleware that gets reused in this space.

And it's important to standardize, have open standards for both of these things, because we if you only have direct programming standards, well, then fine, you either have to write it all yourself, or you have to write some code that might be standard, and then couple it to a whole bunch of nonstandard libraries.

And it doesn't really address the problem.

You're not gonna get an out of the box working code on a new platform if none of the library APIs are standard.

So Khronos SYCL 2020 is the heterogeneous parallel programming standard that Data Parallel C++ is building off of.

So the Data Parallel C++ compiler is, is Clang-based.

It's open source.

It's obviously implementing ISO C++ 'cause that's what Clang does.

And then we're working on SYCL support.

So SYCL 2020 is a provisional specification that is being worked on this year.

So hopefully it'll be polished by the end of the year.

And it builds on SYCL 1.2.1, which is the prior standard that's out there already.

And Intel worked with the SYCL community to implement a number of new features that are going into SYCL 2020 that are important for usability.

These things like unified shared memory, which is a pointer base memory management, which was requested by a lot of users.

Reductions are important for a lot of workloads.

Subgroups are something that help you do more performance tuning.

And then in-order queues was a convenience feature for people that had codes that just mapped naturally onto those instead of out of order queues.

So Intel is gonna continue to work with the SYCL community to bring additional features.

Obviously we want to see everybody else contribute as well, but we're working with the SYCL community as a way to develop a heterogeneous programming language that's standard and not proprietary and not closed source.

So you can go to GitHub and you can see this is the work in progress.

This the upstreaming part of this, which we've, we've talked about with the LVM community, but our compiler is on GitHub and the extensions that we were building last year prior to standardization for SYCL were documented there.

And, and of course, then we helped contribute them to the Khronos SYCL provisional spec.

So why SYCL?

So OpenCL, which is sort of an ancestor to SYCL, it has a well-defined portable execution model, which is really important.

Obviously you wanna start with something that has a portable concept in it, but a lot of application programmers find it too verbose.

And unfortunately that verbosity might've been addressed by middleware using OpenCL as a compiler target, but the fact is that didn't really happen, unfortunately.

And even if there was an OpenCL ecosystem, there's still people that wanna write sort of to the direct language and OpenCL just doesn't have the modern C++ support that a lot of programmers are looking for.

And so SYCL is based on modern C++.

It starts clean with a C++11 view of the world.

It uses things like Lambdas as, as a very important central concept.

And it uses those features to have a single source heterogeneous programming model.

So you can read top to bottom even if your code might be going off an accelerator.

It's really nice to not have that heterogeneous offloaded kernel be in a string or some separate file or some separate function.

You can actually read top to bottom; it's kinda nice.

And SYCL parallelism, if you have never seen it before, it looks like Intel's thread building blocks or C++ Parallel STL. It looks like some other things too, but it, it's explicit in its control over the hardware resources.

So you say, Hey, I wanna run a CPU," or, Hey, I wanna run a GPU." You can obviously pick the default device, and that's gonna work well, too, but in some of these complex heterogeneous systems, you really wanna have explicit control over all your hardware resources.

And SYCL is really the first standard programming model designed to address this problem.

So there's things like Cocos and RAJA from some of the Department of Energy labs, which are trying to address heterogeneous programming and they're open source and they're portable and that's fantastic, but there's also some value in having a standard that can have a bunch of different implementations.

And so that's where SYCL is different.

It's an industry standard that is designed to be implemented a bunch of different times as opposed to a standard open source distribution, which also works for other projects.

So the SYCL ecosystem, as of a month or so ago: there are four well known compilers that are relatively complete.

So the Intel Data Parallel C++ family is the open source version.

Then we obviously productize that.

The open source version has contributions from Codeplay to support Nvidia back-end.

And our compiler will support OpenCL SPIR-V back-end so whether or not those are Intel devices.

I personally worked on a little bit of stuff with our importing.

It's not done yet, but, but there's clearly the opportunity to run this across a wide range of devices, not just Intel's hardware.

So Codeplay has their own implementation called ComputeCpp.

They have a free community edition, which is great, and I use all the time and then they have a commercial product, if that's something you need.

And they support a wide range of hardware, including OpenCL SPIR-V back-ends, as well as they support Nvidia.

So Xilinx Research has something called triSYCL, which is open source.

I use it on my laptop all the time.

It's not a compiler.

It's actually header based, uses Boost, and, and a lot of modern C++, and therefore, it'll pretty much work anywhere where you can get a modern C++ compiler and Boost.

And, and they, I don't own any Xilinx FPGAs, but they have some interest in that.

And, I don't know how it works, so hopefully it does.

So hipSYCL is the last major implementation.

So this comes from the University of Heidelberg, and sort of, it's in the name that it supports HIP, which is AMD GPU compiler back-end, and it also supports CUDA, and they have an OpenMP support.

So if you look across all of these four different compiler implementations, what you see is CPU support, GPU support, and FPGA support across a lot of different vendors.

And so there's a really rich ecosystem out there that people can use.

So it's natural to ask how well does SYCL work?

Is it, is it actually fulfilling any of this promise of working well on multiple devices?

So Tom Deakin and Simon McIntosh-Smith from the University of Bristol have written a paper.

There's a video of them on YouTube and the code is on GitHub, so you can check it all out.

I'm not gonna go into any details other than for bandwidth limited workload, they showed that they can get performance portability with SYCL across all three GPUs from Intel, Nvidia and AMD.

There's a little bit of a bug in the Xeon OpenCL implementation that'll be fixed.

So that's really an artifact of a measurement, but this shows that you can get performance portability with SYCL relative to whether it's CUDA or HIP.

So this is another measurement of performance portability in SYCL by Argonne, Brian Homerding and John Tramm.

And again, you can find all the details online.

This is, I think, really interesting, 'cause it shows, if you look at the axis here, you've got winners on both sides of the axis.

So obviously in some limit you can always make a nonportable proprietary language work better than a portable one, because you can always use all the features plus some, some nonportable ones, but at least in the code that is in the RAJAPerf test suite, there are examples of code that works better with SYCL and better with CUDA.

So this shows that depending on how you write the code and how much effort, how much you tune it, et cetera, you might find that you can get better performance with a standard, you're not necessarily compromising.

The other thing about oneAPI that's important is it's not just about the code that you're writing.

You also wanna have libraries and frameworks supported.

So the left is an actual news story that I found amusing.

And the thing on the right is, of course, I made up, but just to make the point that it's really fantastic to outsource your programming of high performance code to libraries and focus on what you actually wanna do, whether it's AI or science or who knows what.

So the oneAPI libraries that exist, there's one for C++ data parallelism through SYCL, which is really the standard library support.

Math library, deep learning library, some other stuff for AI data analytics, video.

You can find all the details online at the link you'll see on the next slide.

So if you wanna learn more about oneAPI, there's oneapi.com.

It has the specifications.

Then there's the product implementation.

If you're using Linux, you can get it a whole bunch of different ways: package managers, Docker, binaries, installers.

That also supports Windows.

If you don't wanna do anything to your computer, you can sign up for a DevCloud account, account that tries CPU, GPU, and FPGA hardware there.

So you don't even have to buy or set up an FPGA, if you've always wanted to try high level synthesis on that with SYCL.

So there's some links down below.

The first one is a super simple tutorial I wrote all the way working up to a complex, reverse time migration application.

And you can check that stuff out, 'cause it's all in GitHub.

Keyboard shortcuts in the video player
  • Play/pause: space
  • Increase volume: up arrow
  • Decrease volume: down arrow
  • Seek forward: right arrow
  • Seek backward: left arrow
  • Captions on/off: C
  • Fullscreen on/off: F
  • Mute/unmute: M
  • Seek percent: 0-9

Previous: Accelerate ML inference on mobile devices with Android NNAPI All talks Next: Enabling Distributed DNNs for the Mobile Web Over Cloud, Edge and End Devices

Thanks to Futurice for sponsoring the workshop!

futurice

Video hosted by WebCastor on their StreamFizz platform.