Error while adding CUDA /C++ .. can you help please??


// *****************************

global void cuda_sum(int *dev_A,int *dev_sum)
int i = blockIdx.x;

// ******************************

Why dev_sum[0] not give the sum of the elements of the dev_A.

thank you.

// you can find a test here

int main()

int A[10]={1,2,3,4,5,6,7,8,9,10};

int sum[1]={0};
int *dev_A, *dev_sum;

cudaMalloc((void**)&dev_A, 10 * sizeof(int));
    cudaMemcpy(dev_A, A, 10 * sizeof(int), cudaMemcpyHostToDevice);

cudaMalloc((void**)&dev_sum,  sizeof(int)); 
    cudaMemcpy(dev_sum, sum, 1 * sizeof(int), cudaMemcpyHostToDevice);

cuda_sum<<<10,1>>>(dev_A,dev_sum);		  // dev_sum[0] = dev_A[0]+dev_A[0]+ ... +dev_A[9]

cudaMemcpy(sum, dev_sum, 1 * sizeof(int), cudaMemcpyDeviceToHost);

printf("   55  ...  %5d \n",sum[0]);

return 0;


Your code has a race condition, as multiple threads attempt to modify the same object. You would want to look into using atomic adds. There should be a simple demo app for that among the samples distributed with CUDA. Also, check the CUDA Programming Guide.

Thank you,
I would actually like to know if you can synchronization or anything. that i need it für other thing…

As mentioned before you need to use atomicAdd. You have to replace this line:




(the last line works only if sum_dev[0] is integer of float, see the cuda programming guide for double).
This will make sure that the dev_sum[0] is not read until the write is finished by the thread which access it.

For a reduction problem the best approach is presented in this document:

This document is showing speed-up results with older cards which do not hava L1 cache, so the relative speed-ups might be different on Fermi or newer cards, but it is a good start for learining how to use shared memory and how to optimize a code in general.

Hallo, and thnk you very much.
actually i need to use it for the following function… i hopp that you can help me.

Let ::

//**** the Consts
const int N= 1024;
const int NQ=N*N;

// ******** Matrix
float f[N], J[N];
int Q[2*NQ]; // the elements of Q in {0,1,…,(N-1)}

void Host_funtion(int *Q,float *f,float *J)

std::fill(J, J+NK, 0);

for(int i=0;i<NQ;i++)
   int q=2*i;
   float Jq= f[ Q[q+0] ] + f[ Q[q+1]  ];
  J[ Q[q+0] ]-=Jq;
  J[ Q[q+1] ]+=Jq;



It appears to me that the problem is simpler than I thought. If I understood correctly only 2 elements are used. So I would have a thread i loading the elements to register, do the operation and then write back to global memory. For example here is a code (untested):

__global__ my_kernel(float *dev_J,float *dev_f,int2 *dev_Q,N) //
int idx=threadIdx.x+blockIdx.x*blockDim.x;
if(idx < N) 
int2 locq=dev_Q[idx];
// the next 4 lines need to be improved by using textures or shared memory depending on the paattern access given by the dev_Q array.
float jeve=dev_J[locq.x];
float jodd=dev_J[locq.y];

float feve=dev_f[locq.x];
float fodd=dev_f[locq.y];

float locJq=feve+fodd;
// I assume here race conditions, so I would use atomic add (otherwise it is not needed)

// in the main function you need to allocate memory for  dev_J,dev_f and dev_Q
// use the following line for running the kernel
my_kernel < < < (NQ+tpb-1)/tpb,tpb> > >(dev_J,dev_f,(int2 *) dev_Q,NQ); // I made a type casting to int2 for the dev_Q vector for faster access.

tpb can be adjust between 32 and 1024 for maximum performance (tuning parameter).
This code should give the rights results and unless more is known about the Q array I can not do more for optimizing. Textures might help or maybe if you lucky using shared memory if the Q indices are not totally random.

I did not test the code , but it should work.