Hi all,
I am new to cuda. One of the things that bother me a while is that I am writing a simple cuda kernel to sum the float arrays of 8 element with two threads in single block. Each thread is responsible 4 element summation, then the system sum two threads’ result for final summation. forThe first thread is responsible for summing array[0]~array[3], and the second one is responsible for summing array[4]~array[7].
Once each thread is finished. The summation result is sum to the global memory, i.e. result[0] in the example codes. My question is in my codes,
I always get the result[0] either 6.0 or 22.0. The prospective answer should be 28.0. My understanding is that maybe two thread finished the summation almost the same time, and their result[0] can not be updated for two threads mutually. Due to such reason, I try to let two thread
to wait different times to update to result[0]. But as shown in the codes, the output come out is still 22.0 or 6.0.
Could any kindly explain such phenomenon for me.
Many thanks.
===== Codes =====
global void
ComputeKernels_test1( float input, float result)
{
// Thread index
int tx = threadIdx.x; // tx represent polygon id
float accu=0.0;
for (int i = tx4; i < tx4+4; i++) {
accu += input[i];
}
int wait = (tx+1)*10000000;
while(wait–); // wait for different time for different thread
result[0] += accu;
}
void main()
{
float h_Input[8];
for (int i = 0; i < 8; i++) {
h_Input[i] = (float)i;
printf(“%f\n”, h_Input[i]);
}
float d_Input;
cutilSafeCall(cudaMalloc((void*) &d_Input, 8sizeof(float)));
cutilSafeCall(cudaMemcpy(d_Input, h_Input, 8sizeof(float), cudaMemcpyHostToDevice) );
float d_Result;
cutilSafeCall(cudaMalloc((void*) &d_Result, 8sizeof(float)));
cutilSafeCall(cudaMemset( d_Result, 0, 8sizeof(float)));
ComputeKernels_test1<<< 1, 2 >>>(d_Input, d_Result);
float h_Result[8];
cutilSafeCall(cudaMemcpy(h_Result, d_Result, 8*sizeof(float), cudaMemcpyDeviceToHost) );
printf(“total = %f\n”, h_Result[0]);
}