Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out?

Hi there,

I am here in the office with an intern at my company who is doing some CUDA studies. As a starter tutorial we’ve implemented the parallel reduction.

Just today we’ve applied the complete loop unrolling (as in kernel 6 of the nVidia SDK’s reduction sample).

On our nVidia 8500GT we should have a theoretical memory bandwidth of 12 GB/s. And the nVidia reduction code peaks at 8 point something Gigabytes in their kernel #7 (two thirds of the theoretical peak).

But our reduction code barely reaches 1.8 GB/sec - not even close to nVidia’s benchmark!

Would anyone spot our mistake? Why do we see a slowdown factor of 4-5 in a kernel that nearly exactly looks like nVidia’s reduction code? We’ve run this through the visual profiler also. Profiler output is attached (renamed from csv to txt) - we’re puzzled by the number of uncoalesced memory writes, we would expect the number to be identical with the grid x size, but it’s not.

We’re calling the code with N_orig of 16777216 elements (a power of two)

It’s probably something stupid that we’ve overlooked.

Help!?! ;-)


__global__ void

testKernel(float* g_idata, unsigned int N)


	// shared memory

	// the size is determined by the host application

	extern  __shared__  float sdata[];

	SDATA(threadIdx.x) = g_idata[2*blockDim.x*blockIdx.x + threadIdx.x] + g_idata[2*blockDim.x*blockIdx.x + threadIdx.x + N/2];


	N = N/2;

	if (threadIdx.x < 256)

		SDATA(threadIdx.x) += SDATA(threadIdx.x+256);


	if (threadIdx.x < 128)

		SDATA(threadIdx.x) += SDATA(threadIdx.x+128);


	if (threadIdx.x < 64)

		SDATA(threadIdx.x) += SDATA(threadIdx.x+64);


	if (threadIdx.x < 32)


		SDATA(threadIdx.x) += SDATA(threadIdx.x+32);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+16);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+ 8);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+ 4);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+ 2);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+ 1);


	if(threadIdx.x == 0) {

//		 printf("gtid:%d.%d block_max=%f\n", blockIdx.x, threadIdx.x, SDATA(0));

		g_idata[blockIdx.x] = SDATA(0);



Here is the critical part of the host side code.




		// setup kernel execution parameters

		num_threads = ceil(N/2.0);

		num_blocks = ceil(1.0*num_threads/max_threads_per_block);

		threads_per_block = ceil(1.0*num_threads/num_blocks);

		grid = dim3(num_blocks, 1, 1);

		threads = dim3(threads_per_block, 1, 1);

		printf("N=%d, num_threads=%d, num_blocks=%d, threads_per_block=%d, gridDim=%d, blockDim=%d\n",

				N, num_threads, num_blocks, threads_per_block, grid.x, threads.x);

		// execute the kernel

		testKernel<<<grid, threads, sizeof(float)*threads_per_block>>>(d_idata, 2*threads_per_block);

		// check if kernel execution generated and error

		cutilCheckMsg("Kernel execution failed");

		N = num_blocks;

	} while(N != 1);



	float bw_kernel = N_orig*sizeof(float)/time_kernel*1000;

	printf("Parallel CUDA found sum to be %f in %f ms (BW=%f MB/s)\n", cuda_max, time_kernel, bw_kernel/1024/1024);

profiler_output.txt (952 Bytes)

You didn’t post all your code so it’s hard to tell what’s going on.

But your kernel is computing the reduction of only 1024 values per block. (You read 2 global values per thread, and reduce the 512 sums down to one.)

This is computationally inefficient… it means you’re computing 2^24/2/512 = 2^14 reductions with 2^14 blocks.

If you changed your first few lines of your kernel to do a linear accumulation over a (much) larger swath of global memory you could reduce the kernels work.
Right now each thread loads and sums two values from global memory. Change that to perhaps 256. Then you’ll only need 2^24/256/512 = 128 blocks and 128 reductions. Global memory bandwidth will be the same, but FLOPS will be cut down by two orders of magnitude.

If you use too few blocks, you’ll start getting idle SM losses, so 128 blocks is probably a decent value. 256 or 512 blocks might be better for a beefier card like a GTX285.

I belive this is what Mark Harris discusses in the final optimization step in his “Reduction” white waper (between kernel #6 and #7). However it only adds 42% more performance. So this step would not explain the huge gap we’re seing.

Also I find it a little hard to understand why adding that much “serialism” to an otherwhise extremely parallel algorithm can add that much benefit.

You’re right - we should have added the full host code, so a few people could have tried to run this puppy.