Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out?

cbuchner1 · August 12, 2009, 5:09pm

Hi there,

I am here in the office with an intern at my company who is doing some CUDA studies. As a starter tutorial we’ve implemented the parallel reduction.

Just today we’ve applied the complete loop unrolling (as in kernel 6 of the nVidia SDK’s reduction sample).

On our nVidia 8500GT we should have a theoretical memory bandwidth of 12 GB/s. And the nVidia reduction code peaks at 8 point something Gigabytes in their kernel #7 (two thirds of the theoretical peak).

But our reduction code barely reaches 1.8 GB/sec - not even close to nVidia’s benchmark!

Would anyone spot our mistake? Why do we see a slowdown factor of 4-5 in a kernel that nearly exactly looks like nVidia’s reduction code? We’ve run this through the visual profiler also. Profiler output is attached (renamed from csv to txt) - we’re puzzled by the number of uncoalesced memory writes, we would expect the number to be identical with the grid x size, but it’s not.

We’re calling the code with N_orig of 16777216 elements (a power of two)

It’s probably something stupid that we’ve overlooked.

Help!?! ;-)

Christian

__global__ void

testKernel(float* g_idata, unsigned int N)

{

	// shared memory

	// the size is determined by the host application

	extern  __shared__  float sdata[];

	SDATA(threadIdx.x) = g_idata[2*blockDim.x*blockIdx.x + threadIdx.x] + g_idata[2*blockDim.x*blockIdx.x + threadIdx.x + N/2];

	__syncthreads();

	N = N/2;

	if (threadIdx.x < 256)

		SDATA(threadIdx.x) += SDATA(threadIdx.x+256);

	__syncthreads();

	if (threadIdx.x < 128)

		SDATA(threadIdx.x) += SDATA(threadIdx.x+128);

	__syncthreads();

	if (threadIdx.x < 64)

		SDATA(threadIdx.x) += SDATA(threadIdx.x+64);

	__syncthreads();

	if (threadIdx.x < 32)

	{

		SDATA(threadIdx.x) += SDATA(threadIdx.x+32);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+16);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+ 8);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+ 4);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+ 2);

		SDATA(threadIdx.x) += SDATA(threadIdx.x+ 1);

	}

	if(threadIdx.x == 0) {

//		 printf("gtid:%d.%d block_max=%f\n", blockIdx.x, threadIdx.x, SDATA(0));

		g_idata[blockIdx.x] = SDATA(0);

	}

}

Here is the critical part of the host side code.

cutilCheckError(cutCreateTimer(&timer_kernel));

	cutilCheckError(cutStartTimer(timer_kernel));

	do{

		// setup kernel execution parameters

		num_threads = ceil(N/2.0);

		num_blocks = ceil(1.0*num_threads/max_threads_per_block);

		threads_per_block = ceil(1.0*num_threads/num_blocks);

		grid = dim3(num_blocks, 1, 1);

		threads = dim3(threads_per_block, 1, 1);

		printf("N=%d, num_threads=%d, num_blocks=%d, threads_per_block=%d, gridDim=%d, blockDim=%d\n",

				N, num_threads, num_blocks, threads_per_block, grid.x, threads.x);

		// execute the kernel

		testKernel<<<grid, threads, sizeof(float)*threads_per_block>>>(d_idata, 2*threads_per_block);

		// check if kernel execution generated and error

		cutilCheckMsg("Kernel execution failed");

		N = num_blocks;

	} while(N != 1);

	cudaThreadSynchronize();

	cutilCheckError(cutStopTimer(timer_kernel));

	float bw_kernel = N_orig*sizeof(float)/time_kernel*1000;

	printf("Parallel CUDA found sum to be %f in %f ms (BW=%f MB/s)\n", cuda_max, time_kernel, bw_kernel/1024/1024);

Christian
profiler_output.txt (952 Bytes)

SPWorley · August 12, 2009, 6:08pm

You didn’t post all your code so it’s hard to tell what’s going on.

But your kernel is computing the reduction of only 1024 values per block. (You read 2 global values per thread, and reduce the 512 sums down to one.)

This is computationally inefficient… it means you’re computing 2^24/2/512 = 2^14 reductions with 2^14 blocks.

If you changed your first few lines of your kernel to do a linear accumulation over a (much) larger swath of global memory you could reduce the kernels work.
Right now each thread loads and sums two values from global memory. Change that to perhaps 256. Then you’ll only need 2^24/256/512 = 128 blocks and 128 reductions. Global memory bandwidth will be the same, but FLOPS will be cut down by two orders of magnitude.

If you use too few blocks, you’ll start getting idle SM losses, so 128 blocks is probably a decent value. 256 or 512 blocks might be better for a beefier card like a GTX285.

cbuchner1 · August 12, 2009, 11:43pm

I belive this is what Mark Harris discusses in the final optimization step in his “Reduction” white waper (between kernel #6 and #7). However it only adds 42% more performance. So this step would not explain the huge gap we’re seing.

Also I find it a little hard to understand why adding that much “serialism” to an otherwhise extremely parallel algorithm can add that much benefit.

You’re right - we should have added the full host code, so a few people could have tried to run this puppy.

Christian

Topic		Replies	Views
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	157	July 7, 2024
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7713	July 22, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1792	January 14, 2009
Parallel reduction problem CUDA Programming and Performance	1	5079	November 29, 2010
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7887	August 16, 2007
Algorithm Cascading CUDA Programming and Performance	5	7920	March 30, 2013
Cuda code performance CUDA Programming and Performance	14	3093	December 16, 2014
Strange behaviour for CUDA-reductions CUDA Programming and Performance	2	466	January 11, 2019
Limitations of a CUDA kernel reached? CUDA Programming and Performance	3	4323	March 7, 2011
Reduction & block dimension Using the easiest reduction example of the SDK CUDA Programming and Performance	6	2195	November 23, 2009

Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out?

Related topics