parallel 'red'uction: magic & mystery evaluating performance of 'red' instructio

Hi!

after cbuchner1 has posted the inline assembly topic: http://forums.nvidia.com/index.php?showtopic=151666

I cannot resist the temptation to check what the ‘red’ instruction actually does.

note that, it is not available through intrinsics, so this is where inline assembly comes at hand ))

I wrote a program that performs the fastest prefix sum I know and compared it with

parallel reduction using ‘red’, the code is below:

#define WS 32

#define HF 16

// # of threads: 128

__global__ void test_red_kernel(limb *g_R, const limb *g_U) {

	extern __shared__ limb shared[];

	unsigned thid = threadIdx.x, bidx_x = blockIdx.x;

	limb *r = shared;

	unsigned ofs = bidx_x << 7, thid_in_warp = thid & WS-1; // 128 elements per block

	limb a = (g_U + ofs)[thid];

#if 0 // prefix sum

	volatile limb *t = (volatile limb *)r + HF + __umul24(thid >> 5,

			WS + HF + 1) + thid_in_warp;

	t[-HF] = 0,  t[0] = a; // first run 4 warp-sized reductions

	a = a + t[-HF], t[0] = a;

	a = a + t[-8], t[0] = a;

	a = a + t[-4], t[0] = a;

	a = a + t[-2], t[0] = a;

	a = a + t[-1], t[0] = a;

	

	CU_SYNC

	volatile limb *t2 = r + HF + __umul24(WS*4 >> 5, WS + HF + 1);

   // reduce the rest elements

	if(thid < 4) {

		unsigned loc_ofs = HF + WS-1 + __umul24(thid, WS + HF + 1);

		volatile limb *ps = t2 + thid;

		ps[-2] = 0; // put identity elements

		limb a2 = r[loc_ofs]; ps[0] = a2;

		a2 = a2 + ps[-2], ps[0] = a2;

		a2 = a2 + ps[-1], ps[0] = a2;

	}

	CU_SYNC

	a = a + t2[(thid >> 5) - 1]; // update prefix sum

#else // using native 'red' instruction

	

  // well, this is a hack but compiles correctly..

	asm volatile("mov.u32 %r11, shared;" : );

	asm volatile("red.shared.add.u32 [%r11], %0;" :

				"+r"(a) : );

	CU_SYNC

	a = r[0];

#endif

	(g_R + ofs)[thid] = a;

}

note that, ‘red.shared’ does not compute the full prefix sum, only the highest value,

but this is not the point…

I ran the program on GTX280 with a grid of size 65535x1 and 128 threads per block.

Surprisingly, the ‘red’ instruction seems to be slower than the usual prefix sum:

my code: 12.6 ms

using ‘red’: 20.3 ms

so, either I use ‘red’ in a wrong way or NVIDIA reserved it for future GPUs

while currently it is not “hardware-accelerated”

I wonder, has anyone ever tried to use ‘red’ for parallel reductions ?