Could you briefly explain the source code please? I have some doubts
int v = vin [threadIdx.x];
v = max (v, __shfl_xor (v, 16));
v = max (v, __shfl_xor (v, 8));
v = max (v, __shfl_xor (v, 4));
v = max (v, __shfl_xor (v, 2));
v = max (v, __shfl_xor (v, 1));
The attached diagram illustrates the lane “pairings” that are being performed by the 5 SHFL sequence. The top row is lane 0 and the bottom is lane 31.
Each lane performs a MAX against a lane that is +/-16 away from itself, then 8, 4, 2, 1.
The first MAX(laneId,laneId^16) results in lanes 0-15 and 16-31 respectively holding identical values. SHFL-MAX steps 2-5 narrow down the number of MAX candidates until there can be only one.
Is it the most energy-efficient solution? Probably not. Is it the least number of instructions? Probably.
One of the “wins” is that you’re executing fewer instructions.
As an example, I wrote some warp scan routines a while ago and an inclusive-plus-scan implemented using shared memory is ~32 SASS instructions vs. only ~10 SASS instructions in the SHFL implementation (YMMV).
I never benchmarked these implementations but I’m guessing the sm_3x implementation wins. :)
And then store each warp’s max in shared (not volatile) memory, which then can be scanned as a last step (after each thread has called the device function, stored the warp’s values in shared mem, and had a single __syncthreads() statement).
So if threads== 64, then there are two values in shared memory.When threadIdx.x%32==0 those 2 threads store the warp max in shared memory.
The last step when threadIdx.x==0 would be to take the max of those two values then cache in global memory.
This seems to work, but maybe I am overlooking something.
I’m not sure I claim that SHFL is much faster than shared memory. The last slide simply says that it’s faster to implement the algorithms presented during the session using SHFL than safe shared memory code and never slower than unsafe shared memory (warp-synchronous code).
SHFL is a single instruction which can do the same work as two separate instructions: LDS and STS (Load/Store from/to shared memory). But SHFL also has limitations. For example, in 64-bit, you need two SHFL to do the same work.
However, there are cases where SHFL can make a big difference. For example, when you need more occupancy and you’re limited by the amount of shared memory.
Regarding the implementation of a warp-level reduction to find the max, I strongly recommend you to use
CUB: http://nvlabs.github.io/cub/. In particular, take a look at the WarpReduce class.