I want to follow up with my results in case anyone is interested.
Here’s more detail about what I’m doing:
My pseudocode looks like:
KernelCalculateSomething(Parameters){
Declarations;
shared double temp1[threadsInBlock], temp2[threadsInBlock];
for (i = 0 to 500){
assign values to temp1, temp2 based on tid, block id, and i;
Sum temp1[], temp2[];
assign sumTemp1, sumTemp2 to global variables for further processing (global variables length are # of blocks * 500);
}
}
Currently I launch 80 blocks with 512 threads in each block.
As I mentioned before: if I use the code from programming by ex:
////////////////////////////////////////////////////////////////
// Sum Threads
////////////////////////////////////////////////////////////////
shared double temp [threadsPerBlock];
int i = threadsInBlock/2; // i is the number of threads per block
while (i != 0) {
if (tid < i){
temp[tid] += temp[tid + i];
}
__syncthreads();
i /= 2;
}
much of my latency is in the sum: without the sum the kernel averages 7 ms, with the sum it averages 25 ms.
First I took out the while loop and tried to see how much each iteration was adding to the time:
//////////////////////////////////////////////////////////////////////////
// One iteration of code takes: 14 ms
////////////////////////////////////////////////////////////////////////
if (tid < 256){
temp_X_minus[tid] += temp_X_minus[tid + 256];
temp_X_plus[tid] += temp_X_plus[tid + 256];
}
__syncthreads();
//////////////////////////////////////////////////////////////////////////
// Two iteration of code takes: 18 ms
////////////////////////////////////////////////////////////////////////
if (tid < 128){
temp_X_minus[tid] += temp_X_minus[tid + 128];
temp_X_plus[tid] += temp_X_plus[tid + 128];
}
__syncthreads();
//////////////////////////////////////////////////////////////////////////
// Three iteration of code takes: 20 ms
////////////////////////////////////////////////////////////////////////
if (tid < 64){
temp_X_minus[tid] += temp_X_minus[tid + 64];
temp_X_plus[tid] += temp_X_plus[tid + 64];
}
__syncthreads();
Just one iteration to sum from 512 to 256 took my time to 14ms.
Next I tried a different approach:
// Different way of summing:
// 1. Declare two more shared arrays for sums
// 2. Each array has 8 elements
// 3. Sum 64 elements for each array element
The code is :
__shared__ double tempSumXplus [8];
__shared__ double tempSumXminus [8];
int offset;
if (tid < 8){
offset = tid * 64;
tempSumXplus [tid]=temp_X_plus[offset + 0] + temp_X_plus[offset + 1] + temp_X_plus[offset + 2] + temp_X_plus[offset + 3] + temp_X_plus[offset + 4] + temp_X_plus[offset + 5] + \
temp_X_plus[offset + 6] + temp_X_plus[offset + 7] + temp_X_plus[offset + 8] + temp_X_plus[offset + 9] + temp_X_plus[offset + 10] + temp_X_plus[offset + 11] + \
temp_X_plus[offset + 12] + temp_X_plus[offset + 13] + temp_X_plus[offset + 14] + temp_X_plus[offset + 15] + temp_X_plus[offset + 16] + temp_X_plus[offset + 17] + \
temp_X_plus[offset + 18] + temp_X_plus[offset + 19] + temp_X_plus[offset + 20] + temp_X_plus[offset + 21] + temp_X_plus[offset + 22] + temp_X_plus[offset + 23] + \
temp_X_plus[offset + 24] + temp_X_plus[offset + 25] + temp_X_plus[offset + 26] + temp_X_plus[offset + 27] + temp_X_plus[offset + 28] + temp_X_plus[offset + 29] + \
temp_X_plus[offset + 30] + temp_X_plus[offset + 31] + temp_X_plus[offset + 32] + temp_X_plus[offset + 33] + temp_X_plus[offset + 34] + temp_X_plus[offset + 35] + \
temp_X_plus[offset + 36] + temp_X_plus[offset + 37] + temp_X_plus[offset + 38] + temp_X_plus[offset + 39] + temp_X_plus[offset + 40] + temp_X_plus[offset + 41] + \
temp_X_plus[offset + 42] + temp_X_plus[offset + 43] + temp_X_plus[offset + 44] + temp_X_plus[offset + 45] + temp_X_plus[offset + 46] + temp_X_plus[offset + 47] + \
temp_X_plus[offset + 48] + temp_X_plus[offset + 49] + temp_X_plus[offset + 50] + temp_X_plus[offset + 51] + temp_X_plus[offset + 52] + temp_X_plus[offset + 53] + \
temp_X_plus[offset + 54] + temp_X_plus[offset + 55] + temp_X_plus[offset + 56] + temp_X_plus[offset + 57] + temp_X_plus[offset + 58] + temp_X_plus[offset + 59] + \
temp_X_plus[offset + 60] + temp_X_plus[offset + 61] + temp_X_plus[offset + 62] + temp_X_plus[offset + 63];
tempSumXminus[tid] = temp_X_minus[offset + 0] + temp_X_minus[offset + 1] + temp_X_minus[offset + 2] + temp_X_minus[offset + 3] + temp_X_minus[offset + 4] + temp_X_minus[offset + 5] + \
temp_X_minus[offset + 6] + temp_X_minus[offset + 7] + temp_X_minus[offset + 8] + temp_X_minus[offset + 9] + temp_X_minus[offset + 10] + temp_X_minus[offset + 11] + \
temp_X_minus[offset + 12] + temp_X_minus[offset + 13] + temp_X_minus[offset + 14] + temp_X_minus[offset + 15] + temp_X_minus[offset + 16] + temp_X_minus[offset + 17] + \
temp_X_minus[offset + 18] + temp_X_minus[offset + 19] + temp_X_minus[offset + 20] + temp_X_minus[offset + 21] + temp_X_minus[offset + 22] + temp_X_minus[offset + 23] + \
temp_X_minus[offset + 24] + temp_X_minus[offset + 25] + temp_X_minus[offset + 26] + temp_X_minus[offset + 27] + temp_X_minus[offset + 28] + temp_X_minus[offset + 29] + \
temp_X_minus[offset + 30] + temp_X_minus[offset + 31] + temp_X_minus[offset + 32] + temp_X_minus[offset + 33] + temp_X_minus[offset + 34] + temp_X_minus[offset + 35] + \
temp_X_minus[offset + 36] + temp_X_minus[offset + 37] + temp_X_minus[offset + 38] + temp_X_minus[offset + 39] + temp_X_minus[offset + 40] + temp_X_minus[offset + 41] + \
temp_X_minus[offset + 42] + temp_X_minus[offset + 43] + temp_X_minus[offset + 44] + temp_X_minus[offset + 45] + temp_X_minus[offset + 46] + temp_X_minus[offset + 47] + \
temp_X_minus[offset + 48] + temp_X_minus[offset + 49] + temp_X_minus[offset + 50] + temp_X_minus[offset + 51] + temp_X_minus[offset + 52] + temp_X_minus[offset + 53] + \
temp_X_minus[offset + 54] + temp_X_minus[offset + 55] + temp_X_minus[offset + 56] + temp_X_minus[offset + 57] + temp_X_minus[offset + 58] + temp_X_minus[offset + 59] + \
temp_X_minus[offset + 60] + temp_X_minus[offset + 61] + temp_X_minus[offset + 62] + temp_X_minus[offset + 63];
}
__syncthreads();
//////////////////////////////////////////////////////////////////////////////////////
// One Thread
//////////////////////////////////////////////////////////////////////////////////////
if (tid == 0){
dev_X_plus[k*sizeAlpha + alphaIndex] = tempSumXplus[0] + tempSumXplus[1] + tempSumXplus[2] + tempSumXplus[3] +tempSumXplus[4] + tempSumXplus[5] + tempSumXplus[6] + tempSumXplus[7] ;
dev_X_minus[k*sizeAlpha + alphaIndex] = tempSumXminus[0]+ tempSumXminus[1]+tempSumXminus[2]+tempSumXminus[3] + tempSumXminus[4]+ tempSumXminus[5]+tempSumXminus[6]+tempSumXminus[7];
}
This code took a total of 14 ms.
I tried various combinations of shared variable array lengths. The one I showed had array length 8 and each summed 64, it took 14.2:
4 x 128 took 15.6 ms
8 x 64 took 14.2 ms
16 x 32 took 15.1 ms
32 x 16 took 16.0 ms for 1 thread final sum, 15.5 ms for 2 thread final sum
64 x 8 took 16.7 ms for 2 thread final sum.
Each of these results are better than the code shown in Cuda by EX, with 8 x 64 being optimum.
I suspect the result is different for each coder’s code and even input. But this was a very interesting result that I wanted to share.