device global memory update questions

Hi all,

I am new to cuda. One of the things that bother me a while is that I am writing a simple cuda kernel to sum the float arrays of 8 element with two threads in single block. Each thread is responsible 4 element summation, then the system sum two threads’ result for final summation. forThe first thread is responsible for summing array[0]~array[3], and the second one is responsible for summing array[4]~array[7].
Once each thread is finished. The summation result is sum to the global memory, i.e. result[0] in the example codes. My question is in my codes,
I always get the result[0] either 6.0 or 22.0. The prospective answer should be 28.0. My understanding is that maybe two thread finished the summation almost the same time, and their result[0] can not be updated for two threads mutually. Due to such reason, I try to let two thread
to wait different times to update to result[0]. But as shown in the codes, the output come out is still 22.0 or 6.0.

Could any kindly explain such phenomenon for me.
Many thanks.

===== Codes =====
global void
ComputeKernels_test1( float input, float result)
{

// Thread index
int tx = threadIdx.x; // tx represent polygon id

float accu=0.0;

for (int i = tx4; i < tx4+4; i++) {
accu += input[i];
}

int wait = (tx+1)*10000000;
while(wait–); // wait for different time for different thread

result[0] += accu;

}

void main()
{
float h_Input[8];
for (int i = 0; i < 8; i++) {
h_Input[i] = (float)i;
printf("%f\n", h_Input[i]);
}

float d_Input;
cutilSafeCall(cudaMalloc((void
*) &d_Input, 8sizeof(float)));
cutilSafeCall(cudaMemcpy(d_Input, h_Input, 8
sizeof(float), cudaMemcpyHostToDevice) );

float d_Result;
cutilSafeCall(cudaMalloc((void
*) &d_Result, 8sizeof(float)));
cutilSafeCall(cudaMemset( d_Result, 0, 8
sizeof(float)));

ComputeKernels_test1<<< 1, 2 >>>(d_Input, d_Result);

float h_Result[8];

cutilSafeCall(cudaMemcpy(h_Result, d_Result, 8*sizeof(float), cudaMemcpyDeviceToHost) );

printf(“total = %f\n”, h_Result[0]);

}

I see several problems with your code:underutilization, submultiplicty of the warp size, race conditions, and inefficient use of clock cycles.

CUDA is not a lightrider as far as the implementation goes. One needs to have a strong grip on threading and memory locality with threading. I cannot go into all the details here, as I do not have the space. My recommendation is to grab a book on threading, and make sure you understand well what happens within different threads on a CPU. You need that knowledge to be able to understand CUDA. Once you have that, the CUDA Programming Guide will take on a new and vivid meaning.

Regards,
Alex

result[0] += accu;

You have a race condition right here.

Despite having the threads carry out different waits in theory, in reality they are both in the same warp, meaning they both execute the same instruction in an SIMD-like fashion. Thread 1 will do the while loop 20000000 times and Thread 0 will do it 10000000 times and then proceed to do NOPs another 10000000 times. They will both be issued the same instruction, only now since Thread 0 should logically do something different, it will be masked out. After this, both threads continue to write results to the same address and it’s a classical race.

You need to learn about race conditions (if you haven’t already) and about CUDA execution model. Reserve yourself some time to read the Programming Guide in whole (even parts you think aren’t useful for starters).

Hi Nuke,

Thanks for your comments and advices. I know my codes are not optimized and inefficient, but I just want to enhance the phenomenon so write such simple codes. Thanks again.

Thanks Big mac.

You really hit the point. It reminds me that two threads only are definitely in the same warp as one warp contains 32 threads.

I will try to learn more about what you suggest. But before that, I am curious that in what tricks or ways that I can do what I want. I mean use some thread idle time to sum some sub-summations. I know the current ugly way is to to let result[0] to store thread0’s result and result[1] to store thread1’s result, and copy out result[0],result[1] from device to host, and sum result[0] and result[1] in cpu. Is that possible do it in the single cuda kernel run? The real application may involve very large 1-D array summations, to download large array of result to cpu for further summation is

not so economical.

Thanks.

You might want to check out the parallel reduction algorithm. It allows to perform binary operations, such as addition, on n elements in log n steps on parallel hardware.

An example is provided in the SDK.

Another way to do that is to perform a linear addition in the kernel - it works for small datasets like your two threads. Something like

__global__ void kernel( float *input, float* result)

{

// Thread index

int tx = threadIdx.x; // tx represent polygon id

float accu=0.0;

for (int i = tx*4; i < tx*4+4; i++) {

accu += input[i];

}

int wait = (tx+1)*10000000;

while(wait--); // wait for different time for different thread

result[tx] += accu; //use separate memory locations

if(tx == 0) //only thread 0 does summation, no race

  result[0]+=result[1];

}

Note: this code is horrendously suboptimal in many ways but it’s better than copying back to CPU. If you want to do parallel summation, use the reduction algorithm or at the very least, if you’re doing linear summation, use shared memory.

As for having some threads do computations in idle time - threads within a single warp always do the same single instruction (or nothing at all if they’re masked), there’s no way to have them do something different. Threads in different warps and/or blocks can of course diverge at will.

nirvanalangiou_wang,

__global__ void kernel( float *input, float* result)

{

// Thread index

int tx = threadIdx.x; // tx represent polygon id

float accu=0.0f; //<- Note the suffix 'f'; GPU no longer does double to float conversion

#pragma unroll  //<- Small loop; we can safely unroll it

for (int i = tx*4; i < tx*4+4; i++) {

accu += input[i];

}

// <- Removed wait section

__shared__ temp[BLOCK_DIM]; // BLOCK_DIM should be a multiple of the warp size

temp[tx] = accu; //use separate memory locations 

if(!tx) //only thread 0 does summation, no race

  *result = temp[0] + temp[1];

}

(credits: Big_Mac)

I haven’t tested this, so don’t rely on its correctness. Note that since we’re assigning a value to result, we no longer need to call CudaMemset in the host code. A summation such as this is not quite the best example for massive parallelism, and this specific example will always run faster on the CPU due to the associated GPU overhead.

In order to fully exemplify the power of the GPU we need calculations that are completely independent of each other (don’t depend on each other’s results, read and write to different independent memory locations, etc). The Nbody example is a great demo for just that. Be sure to check it out

Great thanks to Alex and Big_Mac’s help on explaining the case and the examples to clarify my concept.
Thanks again.