Hi.
I just started working with CUDA and I had a problem doing some tests.
I’m trying to sum all the elements of a small size array (5 elements).
What am I not doing? The output on the program below is always 0. I tried some other ways, but the answer was either 1 (0 + the last element) or 0.
PS: Sorry for any grammar errors. I’m from Brazil.
#include <stdio.h>
#define SIZE 10
device volatile float sum = 0;
global void ArraySum(float *array)
{
int index = threadIdx.x;
sum = sum + array[index];
__syncthreads();
}
int main(void)
{
float array, *param, h_sum;
size_t size = sizeof(float) * SIZE;
cudaMalloc(¶m, size);
for (int i = 0; i < SIZE; i++)
{
array[i] = 1;
}
cudaMemcpy(param, array, size, cudaMemcpyHostToDevice);
ArraySum<<<1, SIZE>>>(param);
cudaMemcpyFromSymbol(&h_sum, &sum, sizeof(float));
cudaFree(param);
printf("%.2f\n", h_sum);
}
You pass wrong parameter into cudaMemcpyFromSymbol. Please check page 34 of CUDA programming guide (RC4.0).
Also you need to check returned error code of each API, then you will find which one is not correct.
cudaError_t status = cudaMemcpyFromSymbol(&h_sum, sum, sizeof(float));
if (cudaSuccess != status){
printf("Error: %s\n", cudaGetErrorString(status));
exit(1);
}
Finally, you have race condition, 10 threads belong to the same warp, and a warp shares a common program counter,
so only one thread of the warp writes to variable sum.
__global__ void ArraySum(float *array)
{
int index = threadIdx.x;
sum = sum + array[index];
__syncthreads();
}
You are attempting a “Data Parallel Operation.” Mark Harris and others have already done a lot of research on this problem, look up Mark Harris and the CUDPP library.
You can do it also with minimum cuda code using the Thrust Library
raghu
April 6, 2011, 6:53am
5
You need to perform reduction operation while calculating sum of all array elements.
As LSChien pointed out rightly, your current algorithm has race condition introduced in it.