Emulator Mode and Warps

I’ve been writing a reduction sum kernel for my code, using the SDK as a guide. In debugging other portions of the code, I’ve encountered an odd problem. Per the advice of the SDK, the core of my reduction sum is

[codebox]for( int d=blockDim.x/2; d>32; d>>=1 ) {

if( tx < d ) {

  sumSharedMem[tx] += sumSharedMem[d+tx];

}

__syncthreads();

}

if( tx < 32 ) {

sumSharedMem[tx] += sumSharedMem[tx+32];

sumSharedMem[tx] += sumSharedMem[tx+16];

sumSharedMem[tx] += sumSharedMem[tx+8];

sumSharedMem[tx] += sumSharedMem[tx+4];

sumSharedMem[tx] += sumSharedMem[tx+2];

sumSharedMem[tx] += sumSharedMem[tx+1];

}[/codebox]

This works fine on the GPU. However, on the CPU in emulation mode, it gives incorrect answers. If I revert re-roll the final loop iterations, back to

[codebox]for( int d=blockDim.x/2; d>0; d>>=1 ) {

if( tx < d ) {

  sumSharedMem[tx] += sumSharedMem[d+tx];

}

__syncthreads();

}[/codebox]

the code runs fine in emulation mode too.

The unrolled loop has better performance (I know, I could just drop in the SDK code, for even more optimisations…), but I’m going to need to run in emulation sometimes, to debug other bits of the code. Is the emulator accurately reproducing the behaviour of the GPU, or is this one of those little corners where it fails? My temporary workaround is an [font=“Courier New”]#ifdef DEVICE_EMULATION[/font] but I’d like to know what’s going on here. This is on CUDA 2.0, under Fedora 8.

Your code works on hardware because you know that a warp is forced to work in lockstep. To get it to work in emulation, you need a whole bunch more __syncthreads().

Try using the ‘warpSize’ built-in variable, which will == 1 in emulation (and will let your prog work right on any future arch where warp size is different).

Is there a way of doing that loop-unroll based on warpSize? I like the idea of using it, but how would one write something to cope with a general (power-of-two) warp size?

Thanks for the info about the emulator warpSize… I was assuming that some magic was being done to reproduce the GPU hardware more closely.

No. There exists “#pragma unroll”, but since warpSize is not known at compile-time, it won’t work. That’s ok, though, since the overhead of a for() loop isn’t as bad as of the __syncthreads().

Easy. Just split it into two for loop as you have done.

int d;

for( d=blockDim.x/2; d>warpSize; d>>=1 )

{

	if( tx < d )

	   sumSharedMem[tx] += sumSharedMem[tx+d];

	__syncthreads();

}

if( tx < warpSize )

for( /*d=warpSize*/; d>0; d>>=1 )

{

	sumSharedMem[tx] += sumSharedMem[tx+d];

}

Good catch moving the if() out, btw. Also, I have a hunch this will be faster:

int d;

for( d=blockDim.x/2; d>warpSize; d>>=1 )

{

	if( tx >= d )

		return; // or break or goto?

	sumSharedMem[tx] += sumSharedMem[tx+d];

	__syncthreads();

}

if( tx < warpSize )

for( /*d=warpSize*/; d>0; d>>=1 )

{

	sumSharedMem[tx] += sumSharedMem[tx+d];

}

Ha. Nvidia seems to hardly care about that.

You don’t know warp size at compiler time, BUT like the reduction example in the SDK, you can use a #if test to check for the
emulator. Just conditionally compile the extra _syncthreads (or disable the unrolling) based on whether it’s emulator or not.

Again, the reduction code example shows this so you can just copy its method.