Hi!
after cbuchner1 has posted the inline assembly topic: http://forums.nvidia.com/index.php?showtopic=151666
I cannot resist the temptation to check what the ‘red’ instruction actually does.
note that, it is not available through intrinsics, so this is where inline assembly comes at hand ))
I wrote a program that performs the fastest prefix sum I know and compared it with
parallel reduction using ‘red’, the code is below:
#define WS 32
#define HF 16
// # of threads: 128
__global__ void test_red_kernel(limb *g_R, const limb *g_U) {
extern __shared__ limb shared[];
unsigned thid = threadIdx.x, bidx_x = blockIdx.x;
limb *r = shared;
unsigned ofs = bidx_x << 7, thid_in_warp = thid & WS-1; // 128 elements per block
limb a = (g_U + ofs)[thid];
#if 0 // prefix sum
volatile limb *t = (volatile limb *)r + HF + __umul24(thid >> 5,
WS + HF + 1) + thid_in_warp;
t[-HF] = 0, t[0] = a; // first run 4 warp-sized reductions
a = a + t[-HF], t[0] = a;
a = a + t[-8], t[0] = a;
a = a + t[-4], t[0] = a;
a = a + t[-2], t[0] = a;
a = a + t[-1], t[0] = a;
CU_SYNC
volatile limb *t2 = r + HF + __umul24(WS*4 >> 5, WS + HF + 1);
// reduce the rest elements
if(thid < 4) {
unsigned loc_ofs = HF + WS-1 + __umul24(thid, WS + HF + 1);
volatile limb *ps = t2 + thid;
ps[-2] = 0; // put identity elements
limb a2 = r[loc_ofs]; ps[0] = a2;
a2 = a2 + ps[-2], ps[0] = a2;
a2 = a2 + ps[-1], ps[0] = a2;
}
CU_SYNC
a = a + t2[(thid >> 5) - 1]; // update prefix sum
#else // using native 'red' instruction
// well, this is a hack but compiles correctly..
asm volatile("mov.u32 %r11, shared;" : );
asm volatile("red.shared.add.u32 [%r11], %0;" :
"+r"(a) : );
CU_SYNC
a = r[0];
#endif
(g_R + ofs)[thid] = a;
}
note that, ‘red.shared’ does not compute the full prefix sum, only the highest value,
but this is not the point…
I ran the program on GTX280 with a grid of size 65535x1 and 128 threads per block.
Surprisingly, the ‘red’ instruction seems to be slower than the usual prefix sum:
my code: 12.6 ms
using ‘red’: 20.3 ms
so, either I use ‘red’ in a wrong way or NVIDIA reserved it for future GPUs
while currently it is not “hardware-accelerated”
I wonder, has anyone ever tried to use ‘red’ for parallel reductions ?