 # parallel 'red'uction: magic & mystery evaluating performance of 'red' instructio

Hi!

after cbuchner1 has posted the inline assembly topic: http://forums.nvidia.com/index.php?showtopic=151666

I cannot resist the temptation to check what the ‘red’ instruction actually does.

note that, it is not available through intrinsics, so this is where inline assembly comes at hand ))

I wrote a program that performs the fastest prefix sum I know and compared it with

parallel reduction using ‘red’, the code is below:

``````#define WS 32

#define HF 16

__global__ void test_red_kernel(limb *g_R, const limb *g_U) {

extern __shared__ limb shared[];

unsigned thid = threadIdx.x, bidx_x = blockIdx.x;

limb *r = shared;

unsigned ofs = bidx_x << 7, thid_in_warp = thid & WS-1; // 128 elements per block

limb a = (g_U + ofs)[thid];

#if 0 // prefix sum

volatile limb *t = (volatile limb *)r + HF + __umul24(thid >> 5,

WS + HF + 1) + thid_in_warp;

t[-HF] = 0,  t = a; // first run 4 warp-sized reductions

a = a + t[-HF], t = a;

a = a + t[-8], t = a;

a = a + t[-4], t = a;

a = a + t[-2], t = a;

a = a + t[-1], t = a;

CU_SYNC

volatile limb *t2 = r + HF + __umul24(WS*4 >> 5, WS + HF + 1);

// reduce the rest elements

if(thid < 4) {

unsigned loc_ofs = HF + WS-1 + __umul24(thid, WS + HF + 1);

volatile limb *ps = t2 + thid;

ps[-2] = 0; // put identity elements

limb a2 = r[loc_ofs]; ps = a2;

a2 = a2 + ps[-2], ps = a2;

a2 = a2 + ps[-1], ps = a2;

}

CU_SYNC

a = a + t2[(thid >> 5) - 1]; // update prefix sum

#else // using native 'red' instruction

// well, this is a hack but compiles correctly..

asm volatile("mov.u32 %r11, shared;" : );

"+r"(a) : );

CU_SYNC

a = r;

#endif

(g_R + ofs)[thid] = a;

}
``````

note that, ‘red.shared’ does not compute the full prefix sum, only the highest value,

but this is not the point…

I ran the program on GTX280 with a grid of size 65535x1 and 128 threads per block.

Surprisingly, the ‘red’ instruction seems to be slower than the usual prefix sum:

my code: 12.6 ms

using ‘red’: 20.3 ms

so, either I use ‘red’ in a wrong way or NVIDIA reserved it for future GPUs

while currently it is not “hardware-accelerated”

I wonder, has anyone ever tried to use ‘red’ for parallel reductions ?