Hey, i am Oguz, and in this presentation, I will talk about SIMD operations in WebGPU for machine learning.
This is a high level presentation that doesn't go into too much technical detail.
Subgroups are subdivision of threadgroups: they're also named as SIMDgroups, warps, and waves and their operations can make sharing and reducing data across threads in a subgroup measurably faster and we can have these operations in WebGPU shading language.
It is fast to share data between threads in a threadgroup thanks to shared memory but it is faster to share data between threads in a subgroup because they don't even have to go to shared memory.
The desire is containing as much sharing and calculation as possible in these SIMD32 blocks and subgroup operations are a great way to do that.
Subgroup operations reduce runtime and power consumption which can have critical impact on exploratory data analysis, model fine tuning, and edge inference applications but besides, subgroups also bring an intuitive mapping to hardware for algorithms because GPUs have no atomics or advisable locking mechanism for floating point numbers, at least one that will get exposed in WebGPU.
Subgroup operations in WebGPU shading language are compute stage only, active threads only and have a non-uniform execution model.
The meaning of active threads here is in case of a divergence inside a subgroup, the operations provided will only execute across threads that can make to these operations together.
Let's get started with the basic operations: subgroup_size gives us the number of threads in a subgroup, invocation index gives us the index of our thread inside subgroup, and subgroupIsFirst gives us whether we are the first invocation among the active threads.
subgroupAll returns true to all invocations in case all invocations provided with true and -Any returns true in case at least one invocation provides true.
Arithmetic operations essentially provide you with reduction, addition across invocations, multiplication, minimum, maximum, “and”, “or”, “exclusive or”, but it is important to note that these operations will take place across active threads and they can, besides scalar numerical values, they can take vectors too.
Prefix operations provide us with the summation or multiplication of invocations with an index less than the one we are inspecting.
subgroupBallot returns a bitfield where the bit is one in the corresponding invocation in case the invocation provided true to the ballot operation and subgroupBroadcastFirst broadcasts the value in the first lane, the first active lane to rest of the invocations.
On desktop, subgroup operations are available everywhere at least the target of WebGPU and on the mobile, most of the next generation chips support them.
Technically subgroup operations can make it into MVP as a good addition in the standard library of WebGPU shading language because the raised concerns mostly fall out of scope for this PR and not blockers for adoption as is.
Thanks and you can check the PR itself to see more and reach out to me for anything related to subgroups.
Have a nice and healthy day!