Hi.
I just started working with CUDA and I had a problem doing some tests.
I’m trying to sum all the elements of a small size array (5 elements).
What am I not doing? The output on the program below is always 0. I tried some other ways, but the answer was either 1 (0 + the last element) or 0.
PS: Sorry for any grammar errors. I’m from Brazil.
#include   <stdio.h>
#define   SIZE 10
device  volatile float sum = 0;
global  void ArraySum(float *array)
{
int index = threadIdx.x;
sum = sum + array[index];
__syncthreads();
}
int main(void)
{
float array, *param, h_sum;
size_t size = sizeof(float) * SIZE;
cudaMalloc(¶m, size);
for (int i = 0; i < SIZE; i++)
{
	array[i] = 1;
}
cudaMemcpy(param, array, size, cudaMemcpyHostToDevice);
ArraySum<<<1, SIZE>>>(param);
cudaMemcpyFromSymbol(&h_sum, &sum, sizeof(float));
cudaFree(param);
printf("%.2f\n", h_sum);
}
             
            
              
                
            
           
          
            
            
              You pass wrong parameter into cudaMemcpyFromSymbol. Please check page 34 of CUDA programming guide (RC4.0).
Also you need to check returned error code of each API, then you will find which one is not correct.
cudaError_t status = cudaMemcpyFromSymbol(&h_sum, sum, sizeof(float));
if (cudaSuccess != status){
    printf("Error: %s\n", cudaGetErrorString(status));
    exit(1);
}
Finally, you have race condition, 10 threads belong to the same warp, and a warp shares a common program counter,
so only one thread of the warp writes to variable sum.
             
            
              
            
           
          
            
            
              __global__ void ArraySum(float *array)
{
int index = threadIdx.x;
sum = sum + array[index];
__syncthreads();
}
You are attempting a “Data Parallel Operation.”  Mark Harris and others have already done a lot of research on this problem, look up Mark Harris and the CUDPP library.
             
            
              
            
           
          
            
            
              You can do it also with minimum cuda code using the Thrust Library
             
            
              
            
           
          
            
              
                raghu  
              
                  
                    April 6, 2011,  6:53am
                   
                  5 
               
             
            
              You need to perform reduction operation while calculating sum of all array elements.
As LSChien pointed out rightly, your current algorithm has race condition introduced in it.