Summing threads

Hi everyone

I am trying to optimize my CUDA code. I tested each portion of code in my kernel and discovered that my main bottleneck is the summing of threads.

The code takes the value in shared memory temp_X_plus and sums to temp_X_plus[0]. This code is standard CUDA summing code found in Programming by Example.

Are there faster algorithms out there or better configurations? Right now this section of code is taking up 70% of my running time so I’d appreciate any tips.

	////////////////////////////////////////////////////////////////
	// Sum Threads
	///////////////////////////////////////////////////////////////
        __shared__ double temp [threadsPerBlock];
	int i = threadsInBlock/2;  // i is the number of threads per block
	while (i != 0) {
		if (tid < i){
			temp[tid] += temp[tid + i];
		}
		__syncthreads();
		i /= 2;
	}

If this is taking up 70% of the execution time of your kernel, the rest must be pretty minimal. In that case, have the each thread of the kernel perform multiple of those calculations, which are trivially summed, before doing the final reduction through shared memory.

How did you measure execution time to figure out this code takes up 70% of the execution time? If you just commented it out, it is quite likely that the compiler just removed the code to calculate the contents of temp as well, since then they are not needed anymore. This would make the reduction appear to take up all the time, while in reality it is spent in the calculation of temp.

Hi Tera,

I left the calculation of temp but commented out the sum. You are right that the rest of the code is minimal. I have eliminated latency step by step by pre-processing all my calculations into lookup tables.

Thanks for your suggestion on the speed up. I have tried removing the last few iterations (i = 2, i = 4 etc) and there is a noticeable difference as expected.

I want to follow up with my results in case anyone is interested.

Here’s more detail about what I’m doing:

My pseudocode looks like:

KernelCalculateSomething(Parameters){

 Declarations;

 shared double temp1[threadsInBlock], temp2[threadsInBlock];

 for (i = 0 to 500){

     assign values to temp1, temp2 based on tid, block id, and i;

     Sum temp1[], temp2[];

     assign sumTemp1, sumTemp2 to global variables for further processing (global variables length are # of blocks * 500);

 }

}

Currently I launch 80 blocks with 512 threads in each block.

As I mentioned before: if I use the code from programming by ex:

////////////////////////////////////////////////////////////////

// Sum Threads

////////////////////////////////////////////////////////////////

shared double temp [threadsPerBlock];

int i = threadsInBlock/2; // i is the number of threads per block

while (i != 0) {

if (tid < i){

temp[tid] += temp[tid + i];

}

__syncthreads();

i /= 2;

}

much of my latency is in the sum: without the sum the kernel averages 7 ms, with the sum it averages 25 ms.

First I took out the while loop and tried to see how much each iteration was adding to the time:

//////////////////////////////////////////////////////////////////////////

		// One iteration of code takes: 14 ms

		////////////////////////////////////////////////////////////////////////

		if (tid < 256){

			temp_X_minus[tid] += temp_X_minus[tid + 256];

			temp_X_plus[tid] += temp_X_plus[tid + 256];

		}

		__syncthreads();

		//////////////////////////////////////////////////////////////////////////

		// Two iteration of code takes: 18 ms

		////////////////////////////////////////////////////////////////////////

		if (tid < 128){

			temp_X_minus[tid] += temp_X_minus[tid + 128];

			temp_X_plus[tid] += temp_X_plus[tid + 128];

		}

		__syncthreads();

		//////////////////////////////////////////////////////////////////////////

		// Three iteration of code takes: 20 ms

		////////////////////////////////////////////////////////////////////////

		if (tid < 64){

			temp_X_minus[tid] += temp_X_minus[tid + 64];

			temp_X_plus[tid] += temp_X_plus[tid + 64];

		}

		__syncthreads();

Just one iteration to sum from 512 to 256 took my time to 14ms.

Next I tried a different approach:

// Different way of summing:

// 1. Declare two more shared arrays for sums

// 2. Each array has 8 elements

// 3. Sum 64 elements for each array element

The code is :

__shared__ double tempSumXplus [8];

		__shared__ double tempSumXminus [8];

		int offset;

		if (tid < 8){

			offset = tid * 64;

			tempSumXplus [tid]=temp_X_plus[offset + 0] + temp_X_plus[offset + 1] + temp_X_plus[offset + 2] + temp_X_plus[offset + 3] + temp_X_plus[offset + 4] + temp_X_plus[offset + 5] + \

					temp_X_plus[offset + 6] + temp_X_plus[offset + 7] + temp_X_plus[offset + 8] + temp_X_plus[offset + 9] + temp_X_plus[offset + 10] + temp_X_plus[offset + 11] + \

					temp_X_plus[offset + 12] + temp_X_plus[offset + 13] + temp_X_plus[offset + 14] + temp_X_plus[offset + 15] + temp_X_plus[offset + 16] + temp_X_plus[offset + 17] + \

					temp_X_plus[offset + 18] + temp_X_plus[offset + 19] + temp_X_plus[offset + 20] + temp_X_plus[offset + 21] + temp_X_plus[offset + 22] + temp_X_plus[offset + 23] + \

					temp_X_plus[offset + 24] + temp_X_plus[offset + 25] + temp_X_plus[offset + 26] + temp_X_plus[offset + 27] + temp_X_plus[offset + 28] + temp_X_plus[offset + 29] + \

					temp_X_plus[offset + 30] + temp_X_plus[offset + 31] + temp_X_plus[offset + 32] + temp_X_plus[offset + 33] + temp_X_plus[offset + 34] + temp_X_plus[offset + 35] + \

					temp_X_plus[offset + 36] + temp_X_plus[offset + 37] + temp_X_plus[offset + 38] + temp_X_plus[offset + 39] + temp_X_plus[offset + 40] + temp_X_plus[offset + 41] + \

					temp_X_plus[offset + 42] + temp_X_plus[offset + 43] + temp_X_plus[offset + 44] + temp_X_plus[offset + 45] + temp_X_plus[offset + 46] + temp_X_plus[offset + 47] + \

					temp_X_plus[offset + 48] + temp_X_plus[offset + 49] + temp_X_plus[offset + 50] + temp_X_plus[offset + 51] + temp_X_plus[offset + 52] + temp_X_plus[offset + 53] + \

					temp_X_plus[offset + 54] + temp_X_plus[offset + 55] + temp_X_plus[offset + 56] + temp_X_plus[offset + 57] + temp_X_plus[offset + 58] + temp_X_plus[offset + 59] + \

					temp_X_plus[offset + 60] + temp_X_plus[offset + 61] + temp_X_plus[offset + 62] + temp_X_plus[offset + 63];

			tempSumXminus[tid] = temp_X_minus[offset + 0] + temp_X_minus[offset + 1] + temp_X_minus[offset + 2] + temp_X_minus[offset + 3] + temp_X_minus[offset + 4] + temp_X_minus[offset + 5] + \

				temp_X_minus[offset + 6] + temp_X_minus[offset + 7] + temp_X_minus[offset + 8] + temp_X_minus[offset + 9] + temp_X_minus[offset + 10] + temp_X_minus[offset + 11] + \

				temp_X_minus[offset + 12] + temp_X_minus[offset + 13] + temp_X_minus[offset + 14] + temp_X_minus[offset + 15] + temp_X_minus[offset + 16] + temp_X_minus[offset + 17] + \

				temp_X_minus[offset + 18] + temp_X_minus[offset + 19] + temp_X_minus[offset + 20] + temp_X_minus[offset + 21] + temp_X_minus[offset + 22] + temp_X_minus[offset + 23] + \

				temp_X_minus[offset + 24] + temp_X_minus[offset + 25] + temp_X_minus[offset + 26] + temp_X_minus[offset + 27] + temp_X_minus[offset + 28] + temp_X_minus[offset + 29] + \

				temp_X_minus[offset + 30] + temp_X_minus[offset + 31] + temp_X_minus[offset + 32] + temp_X_minus[offset + 33] + temp_X_minus[offset + 34] + temp_X_minus[offset + 35] + \

				temp_X_minus[offset + 36] + temp_X_minus[offset + 37] + temp_X_minus[offset + 38] + temp_X_minus[offset + 39] + temp_X_minus[offset + 40] + temp_X_minus[offset + 41] + \

				temp_X_minus[offset + 42] + temp_X_minus[offset + 43] + temp_X_minus[offset + 44] + temp_X_minus[offset + 45] + temp_X_minus[offset + 46] + temp_X_minus[offset + 47] + \

				temp_X_minus[offset + 48] + temp_X_minus[offset + 49] + temp_X_minus[offset + 50] + temp_X_minus[offset + 51] + temp_X_minus[offset + 52] + temp_X_minus[offset + 53] + \

				temp_X_minus[offset + 54] + temp_X_minus[offset + 55] + temp_X_minus[offset + 56] + temp_X_minus[offset + 57] + temp_X_minus[offset + 58] + temp_X_minus[offset + 59] + \

				temp_X_minus[offset + 60] + temp_X_minus[offset + 61] + temp_X_minus[offset + 62] + temp_X_minus[offset + 63];

		}

		__syncthreads();

		//////////////////////////////////////////////////////////////////////////////////////

		// One Thread

		//////////////////////////////////////////////////////////////////////////////////////

		if (tid == 0){

			dev_X_plus[k*sizeAlpha + alphaIndex] = tempSumXplus[0] + tempSumXplus[1] + tempSumXplus[2] + tempSumXplus[3] +tempSumXplus[4] + tempSumXplus[5] + tempSumXplus[6] + tempSumXplus[7] ;

			dev_X_minus[k*sizeAlpha + alphaIndex] = tempSumXminus[0]+ tempSumXminus[1]+tempSumXminus[2]+tempSumXminus[3] + tempSumXminus[4]+ tempSumXminus[5]+tempSumXminus[6]+tempSumXminus[7];

		}

This code took a total of 14 ms.

I tried various combinations of shared variable array lengths. The one I showed had array length 8 and each summed 64, it took 14.2:

4 x 128 took 15.6 ms

8 x 64 took 14.2 ms

16 x 32 took 15.1 ms

32 x 16 took 16.0 ms for 1 thread final sum, 15.5 ms for 2 thread final sum

64 x 8 took 16.7 ms for 2 thread final sum.

Each of these results are better than the code shown in Cuda by EX, with 8 x 64 being optimum.

I suspect the result is different for each coder’s code and even input. But this was a very interesting result that I wanted to share.