Summing threads

skyang · June 6, 2011, 9:47pm

Hi everyone

I am trying to optimize my CUDA code. I tested each portion of code in my kernel and discovered that my main bottleneck is the summing of threads.

The code takes the value in shared memory temp_X_plus and sums to temp_X_plus[0]. This code is standard CUDA summing code found in Programming by Example.

Are there faster algorithms out there or better configurations? Right now this section of code is taking up 70% of my running time so I’d appreciate any tips.

	////////////////////////////////////////////////////////////////
	// Sum Threads
	///////////////////////////////////////////////////////////////
        __shared__ double temp [threadsPerBlock];
	int i = threadsInBlock/2;  // i is the number of threads per block
	while (i != 0) {
		if (tid < i){
			temp[tid] += temp[tid + i];
		}
		__syncthreads();
		i /= 2;
	}

tera · June 6, 2011, 11:28pm

If this is taking up 70% of the execution time of your kernel, the rest must be pretty minimal. In that case, have the each thread of the kernel perform multiple of those calculations, which are trivially summed, before doing the final reduction through shared memory.

How did you measure execution time to figure out this code takes up 70% of the execution time? If you just commented it out, it is quite likely that the compiler just removed the code to calculate the contents of temp as well, since then they are not needed anymore. This would make the reduction appear to take up all the time, while in reality it is spent in the calculation of temp.

skyang · June 7, 2011, 3:12am

Hi Tera,

I left the calculation of temp but commented out the sum. You are right that the rest of the code is minimal. I have eliminated latency step by step by pre-processing all my calculations into lookup tables.

Thanks for your suggestion on the speed up. I have tried removing the last few iterations (i = 2, i = 4 etc) and there is a noticeable difference as expected.

skyang · June 7, 2011, 8:30pm

I want to follow up with my results in case anyone is interested.

Here’s more detail about what I’m doing:

My pseudocode looks like:

KernelCalculateSomething(Parameters){

 Declarations;

 shared double temp1[threadsInBlock], temp2[threadsInBlock];

 for (i = 0 to 500){

     assign values to temp1, temp2 based on tid, block id, and i;

     Sum temp1[], temp2[];

     assign sumTemp1, sumTemp2 to global variables for further processing (global variables length are # of blocks * 500);

 }

}

Currently I launch 80 blocks with 512 threads in each block.

As I mentioned before: if I use the code from programming by ex:

////////////////////////////////////////////////////////////////

// Sum Threads

////////////////////////////////////////////////////////////////

shared double temp [threadsPerBlock];

int i = threadsInBlock/2; // i is the number of threads per block

while (i != 0) {

if (tid < i){

temp[tid] += temp[tid + i];

}

__syncthreads();

i /= 2;

}

much of my latency is in the sum: without the sum the kernel averages 7 ms, with the sum it averages 25 ms.

First I took out the while loop and tried to see how much each iteration was adding to the time:

//////////////////////////////////////////////////////////////////////////

		// One iteration of code takes: 14 ms

		////////////////////////////////////////////////////////////////////////

		if (tid < 256){

			temp_X_minus[tid] += temp_X_minus[tid + 256];

			temp_X_plus[tid] += temp_X_plus[tid + 256];

		}

		__syncthreads();

		//////////////////////////////////////////////////////////////////////////

		// Two iteration of code takes: 18 ms

		////////////////////////////////////////////////////////////////////////

		if (tid < 128){

			temp_X_minus[tid] += temp_X_minus[tid + 128];

			temp_X_plus[tid] += temp_X_plus[tid + 128];

		}

		__syncthreads();

		//////////////////////////////////////////////////////////////////////////

		// Three iteration of code takes: 20 ms

		////////////////////////////////////////////////////////////////////////

		if (tid < 64){

			temp_X_minus[tid] += temp_X_minus[tid + 64];

			temp_X_plus[tid] += temp_X_plus[tid + 64];

		}

		__syncthreads();

Just one iteration to sum from 512 to 256 took my time to 14ms.

Next I tried a different approach:

// Different way of summing:

// 1. Declare two more shared arrays for sums

// 2. Each array has 8 elements

// 3. Sum 64 elements for each array element

The code is :

__shared__ double tempSumXplus [8];

		__shared__ double tempSumXminus [8];

		int offset;

		if (tid < 8){

			offset = tid * 64;

			tempSumXplus [tid]=temp_X_plus[offset + 0] + temp_X_plus[offset + 1] + temp_X_plus[offset + 2] + temp_X_plus[offset + 3] + temp_X_plus[offset + 4] + temp_X_plus[offset + 5] + \

					temp_X_plus[offset + 6] + temp_X_plus[offset + 7] + temp_X_plus[offset + 8] + temp_X_plus[offset + 9] + temp_X_plus[offset + 10] + temp_X_plus[offset + 11] + \

					temp_X_plus[offset + 12] + temp_X_plus[offset + 13] + temp_X_plus[offset + 14] + temp_X_plus[offset + 15] + temp_X_plus[offset + 16] + temp_X_plus[offset + 17] + \

					temp_X_plus[offset + 18] + temp_X_plus[offset + 19] + temp_X_plus[offset + 20] + temp_X_plus[offset + 21] + temp_X_plus[offset + 22] + temp_X_plus[offset + 23] + \

					temp_X_plus[offset + 24] + temp_X_plus[offset + 25] + temp_X_plus[offset + 26] + temp_X_plus[offset + 27] + temp_X_plus[offset + 28] + temp_X_plus[offset + 29] + \

					temp_X_plus[offset + 30] + temp_X_plus[offset + 31] + temp_X_plus[offset + 32] + temp_X_plus[offset + 33] + temp_X_plus[offset + 34] + temp_X_plus[offset + 35] + \

					temp_X_plus[offset + 36] + temp_X_plus[offset + 37] + temp_X_plus[offset + 38] + temp_X_plus[offset + 39] + temp_X_plus[offset + 40] + temp_X_plus[offset + 41] + \

					temp_X_plus[offset + 42] + temp_X_plus[offset + 43] + temp_X_plus[offset + 44] + temp_X_plus[offset + 45] + temp_X_plus[offset + 46] + temp_X_plus[offset + 47] + \

					temp_X_plus[offset + 48] + temp_X_plus[offset + 49] + temp_X_plus[offset + 50] + temp_X_plus[offset + 51] + temp_X_plus[offset + 52] + temp_X_plus[offset + 53] + \

					temp_X_plus[offset + 54] + temp_X_plus[offset + 55] + temp_X_plus[offset + 56] + temp_X_plus[offset + 57] + temp_X_plus[offset + 58] + temp_X_plus[offset + 59] + \

					temp_X_plus[offset + 60] + temp_X_plus[offset + 61] + temp_X_plus[offset + 62] + temp_X_plus[offset + 63];

			tempSumXminus[tid] = temp_X_minus[offset + 0] + temp_X_minus[offset + 1] + temp_X_minus[offset + 2] + temp_X_minus[offset + 3] + temp_X_minus[offset + 4] + temp_X_minus[offset + 5] + \

				temp_X_minus[offset + 6] + temp_X_minus[offset + 7] + temp_X_minus[offset + 8] + temp_X_minus[offset + 9] + temp_X_minus[offset + 10] + temp_X_minus[offset + 11] + \

				temp_X_minus[offset + 12] + temp_X_minus[offset + 13] + temp_X_minus[offset + 14] + temp_X_minus[offset + 15] + temp_X_minus[offset + 16] + temp_X_minus[offset + 17] + \

				temp_X_minus[offset + 18] + temp_X_minus[offset + 19] + temp_X_minus[offset + 20] + temp_X_minus[offset + 21] + temp_X_minus[offset + 22] + temp_X_minus[offset + 23] + \

				temp_X_minus[offset + 24] + temp_X_minus[offset + 25] + temp_X_minus[offset + 26] + temp_X_minus[offset + 27] + temp_X_minus[offset + 28] + temp_X_minus[offset + 29] + \

				temp_X_minus[offset + 30] + temp_X_minus[offset + 31] + temp_X_minus[offset + 32] + temp_X_minus[offset + 33] + temp_X_minus[offset + 34] + temp_X_minus[offset + 35] + \

				temp_X_minus[offset + 36] + temp_X_minus[offset + 37] + temp_X_minus[offset + 38] + temp_X_minus[offset + 39] + temp_X_minus[offset + 40] + temp_X_minus[offset + 41] + \

				temp_X_minus[offset + 42] + temp_X_minus[offset + 43] + temp_X_minus[offset + 44] + temp_X_minus[offset + 45] + temp_X_minus[offset + 46] + temp_X_minus[offset + 47] + \

				temp_X_minus[offset + 48] + temp_X_minus[offset + 49] + temp_X_minus[offset + 50] + temp_X_minus[offset + 51] + temp_X_minus[offset + 52] + temp_X_minus[offset + 53] + \

				temp_X_minus[offset + 54] + temp_X_minus[offset + 55] + temp_X_minus[offset + 56] + temp_X_minus[offset + 57] + temp_X_minus[offset + 58] + temp_X_minus[offset + 59] + \

				temp_X_minus[offset + 60] + temp_X_minus[offset + 61] + temp_X_minus[offset + 62] + temp_X_minus[offset + 63];

		}

		__syncthreads();

		//////////////////////////////////////////////////////////////////////////////////////

		// One Thread

		//////////////////////////////////////////////////////////////////////////////////////

		if (tid == 0){

			dev_X_plus[k*sizeAlpha + alphaIndex] = tempSumXplus[0] + tempSumXplus[1] + tempSumXplus[2] + tempSumXplus[3] +tempSumXplus[4] + tempSumXplus[5] + tempSumXplus[6] + tempSumXplus[7] ;

			dev_X_minus[k*sizeAlpha + alphaIndex] = tempSumXminus[0]+ tempSumXminus[1]+tempSumXminus[2]+tempSumXminus[3] + tempSumXminus[4]+ tempSumXminus[5]+tempSumXminus[6]+tempSumXminus[7];

		}

This code took a total of 14 ms.

I tried various combinations of shared variable array lengths. The one I showed had array length 8 and each summed 64, it took 14.2:

4 x 128 took 15.6 ms

8 x 64 took 14.2 ms

16 x 32 took 15.1 ms

32 x 16 took 16.0 ms for 1 thread final sum, 15.5 ms for 2 thread final sum

64 x 8 took 16.7 ms for 2 thread final sum.

Each of these results are better than the code shown in Cuda by EX, with 8 x 64 being optimum.

I suspect the result is different for each coder’s code and even input. But this was a very interesting result that I wanted to share.

Topic		Replies	Views
Array Sum in cuda CUDA Programming and Performance	5	11514	May 30, 2010
Urgent help with threads please! CUDA Programming and Performance	21	10826	March 6, 2008
CUDA - calculation of a sum CUDA Programming and Performance	7	5598	April 30, 2010
Code optimization how to optimize CUDA Programming and Performance	6	1088	June 1, 2011
sequential sum within a kernel. CUDA Programming and Performance	23	5050	September 8, 2008
Interpretation of Kernel CUDA Programming and Performance	4	3099	August 11, 2009
Sum vectors CUDA Programming and Performance	24	5162	December 21, 2011
Accumulate value within block CUDA Programming and Performance	15	3216	October 16, 2010
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6538	April 13, 2009
computing a sum leads to infinite values CUDA Programming and Performance	3	5391	September 16, 2008

Summing threads

Related topics