I wrote a kernel implementing the propagation of neural network which calls atomicAdd function several times. I preset the network as zero while the input layer has the value 2.0; the weights between neurons are all 1.0. I’m sure I’ve done the cudaMalloc and cudaMemcpy work. However, the result of the output layer was always plain-zero, unless I inserted a printf() function to see what kernel calculated – nonzero was printed but still parts of the final output layer equal to zero.
__global__ void goForwardGPU(double* single_piece, double* xs, double* vals, bool* availables)
{ // GPU go forward
// input layer -> hidden layer1
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < layerInput; i += gridDim.x * blockDim.x) {
for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < layer2; j += gridDim.y * blockDim.y) {
atomicAdd(&xs[layerInput + j], availables[j] * single_piece[i] * vals[i * layer2 + j]);
// printf("%lf\n", xs[layerInput + j]);
}
}
// hidden layer1 -> hidden layer2...
// ...
}
What’s wrong with atomicAdd, or is there other problem occurred?