writing harzard


I try to do some summation. Here is the example code directly translated from C to cuda:

global static void Summation_kernel(
float *d_A,
float *d_B,
int *d_indx

 const int tid = blockIdx.x*blockDim.x + threadIdx.x;
 const int  max_tid = 1000;
 const int  step = blockDim.x * gridDim.x;

 int   k;

 for(int i = tid; i <  max_tid; i += step){
     k = d_indx[i];
     d_A[k] += d_B[i];


This kernel doesn’t work properly. I can only get a correct answer when use one thread per block and one block per grid. Is there any method to solve this writing hazard problem in CUDA? Thanks for any suggestion.


Hi Xuejun,

Can you send the complete code with main(). Although I am also new to CUDA but may be I will research and would provide you some pointers. Basically i need the complete code that you will be compiling.


In other words, it does compute something, just not what you wanted? In that case, one would need to know what you are trying to compute in order to help.

Without any more info, I can see one potential problem:

k = d_indx[i];

This k must be unique for each thread, as you’re using a += I’m assuming it isn’t.

Unless you wait for each thread to write to

d_A[k] += d_B[i];

before moving on, you will encounter problems. (Putting a thread wait here will result in VERY poor performance)

Look at “reduction” in the SDK examples, it should help you solve your problem - safely!