shfl function in kepler

blacklotus · March 14, 2013, 5:59pm

I need to extract each warp the maximum value, which function is the most suitable for this case?

for example, to the following fragment of code, why the loop progresses in powers of 2 with __shfl_up? (makes the sum of the content of warp)

for (int i=1; i<=warp_size; i*=2)
    {
        int n = __shfl_up(value, i, warp_size);

        if (lane_id >= i) value += n;
    }

Many thanks

allanmac · March 14, 2013, 8:42pm

Something like this should work:

v = max(v,__shfl_xor(v,16));
v = max(v,__shfl_xor(v, 8));
v = max(v,__shfl_xor(v, 4));
v = max(v,__shfl_xor(v, 2));
v = max(v,__shfl_xor(v, 1));

This will leave the max value for the warp in every lane.

(Gist here)

blacklotus · March 14, 2013, 10:08pm

Could you briefly explain the source code please? I have some doubts

int v = vin [threadIdx.x];
 
v = max (v, __shfl_xor (v, 16));
v = max (v, __shfl_xor (v, 8));
v = max (v, __shfl_xor (v, 4));
v = max (v, __shfl_xor (v, 2));
v = max (v, __shfl_xor (v, 1));

Many thanks

blacklotus · March 15, 2013, 4:27pm

why laneMask in shfl_xor is a static value? thanks.

allanmac · March 15, 2013, 11:15pm

The attached diagram illustrates the lane “pairings” that are being performed by the 5 SHFL sequence. The top row is lane 0 and the bottom is lane 31.

Each lane performs a MAX against a lane that is +/-16 away from itself, then 8, 4, 2, 1.

The first MAX(laneId,laneId^16) results in lanes 0-15 and 16-31 respectively holding identical values. SHFL-MAX steps 2-5 narrow down the number of MAX candidates until there can be only one.

Is it the most energy-efficient solution? Probably not. Is it the least number of instructions? Probably.

External Media

CudaaduC · November 24, 2013, 7:40pm

This is an interesting thread, and I wanted to BUMP it since I was testing the __shfl() in some reduction code.

Overall I have not noticed a large speed increase using __shfl() when compared to the standard shared memory techniques.

Nividia here claims it is much faster:

[url]http://on-demand.gputechconf.com/gtc/2013/presentations/S3174-Kepler-Shuffle-Tips-Tricks.pdf[/url]

but proved few examples other than the often seen sum reduction.

The only info available via Google is allanmac’s code, which is the most useful I have found.

Does any out there have more snippets of code where they use some of these newer Kepler features?

allanmac · November 24, 2013, 8:00pm

One of the “wins” is that you’re executing fewer instructions.

As an example, I wrote some warp scan routines a while ago and an inclusive-plus-scan implemented using shared memory is ~32 SASS instructions vs. only ~10 SASS instructions in the SHFL implementation (YMMV).

I never benchmarked these implementations but I’m guessing the sm_3x implementation wins. :)

Some more links on SHFL:

Experiments with SHFL — describes SHFL "rotations"

Github code for a basic warp scan

A C macro for declaring optimal inclusive/exclusive warp scans

I ❤ SHFL.

CudaaduC · November 24, 2013, 8:14pm

Thanks Allanmac!

While I am not yet 100% sure what is going on in the asm portions of the code, I am going to use it as a reference.

The only way to really learn how to use these newer features is by trial and error.

CudaaduC · November 25, 2013, 1:20am

Assuming you have threads>32 and need to get a max for a thread block, is there any reason not to do this:

__device__ float warp_max_registers(float myVal){

	int warpIndex = threadIdx.x%32;

	myVal=max(myVal,__shfl(myVal, warpIndex + 16));
	myVal=max(myVal,__shfl(myVal, warpIndex + 8));
	myVal=max(myVal,__shfl(myVal, warpIndex + 4));
	myVal=max(myVal,__shfl(myVal, warpIndex + 2));
	myVal=max(myVal,__shfl(myVal, warpIndex + 1));

	return myVal;
}

And then store each warp’s max in shared (not volatile) memory, which then can be scanned as a last step (after each thread has called the device function, stored the warp’s values in shared mem, and had a single __syncthreads() statement).

So if threads== 64, then there are two values in shared memory.When threadIdx.x%32==0 those 2 threads store the warp max in shared memory.
The last step when threadIdx.x==0 would be to take the max of those two values then cache in global memory.

This seems to work, but maybe I am overlooking something.

allanmac · November 25, 2013, 1:51am

Sounds good to me.

CudaaduC · November 25, 2013, 2:07am

Ditto…

Thanks

jdemouth · November 26, 2013, 9:56am

I’m not sure I claim that SHFL is much faster than shared memory. The last slide simply says that it’s faster to implement the algorithms presented during the session using SHFL than safe shared memory code and never slower than unsafe shared memory (warp-synchronous code).

SHFL is a single instruction which can do the same work as two separate instructions: LDS and STS (Load/Store from/to shared memory). But SHFL also has limitations. For example, in 64-bit, you need two SHFL to do the same work.

However, there are cases where SHFL can make a big difference. For example, when you need more occupancy and you’re limited by the amount of shared memory.

Regarding the implementation of a warp-level reduction to find the max, I strongly recommend you to use
CUB: http://nvlabs.github.io/cub/. In particular, take a look at the WarpReduce class.