I try to do some summation. Here is the example code directly translated from C to cuda:
global static void Summation_kernel(
float *d_A,
float *d_B,
int *d_indx
){
const int tid = blockIdx.x*blockDim.x + threadIdx.x;
const int max_tid = 1000;
const int step = blockDim.x * gridDim.x;
int k;
for(int i = tid; i < max_tid; i += step){
k = d_indx[i];
d_A[k] += d_B[i];
}
}
This kernel doesn’t work properly. I can only get a correct answer when use one thread per block and one block per grid. Is there any method to solve this writing hazard problem in CUDA? Thanks for any suggestion.
Can you send the complete code with main(). Although I am also new to CUDA but may be I will research and would provide you some pointers. Basically i need the complete code that you will be compiling.
In other words, it does compute something, just not what you wanted? In that case, one would need to know what you are trying to compute in order to help.