Why does a kernel which contains atomic functions return correct result unless I insert a printf() to check it?

I wrote a kernel implementing the propagation of neural network which calls atomicAdd function several times. I preset the network as zero while the input layer has the value 2.0; the weights between neurons are all 1.0. I’m sure I’ve done the cudaMalloc and cudaMemcpy work. However, the result of the output layer was always plain-zero, unless I inserted a printf() function to see what kernel calculated – nonzero was printed but still parts of the final output layer equal to zero.

__global__ void goForwardGPU(double* single_piece, double* xs, double* vals, bool* availables)
{ // GPU go forward
    // input layer -> hidden layer1
    for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < layerInput; i += gridDim.x * blockDim.x) {
        for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < layer2; j += gridDim.y * blockDim.y) {
            atomicAdd(&xs[layerInput + j], availables[j] * single_piece[i] * vals[i * layer2 + j]);
            // printf("%lf\n", xs[layerInput + j]);
        }
    }
    // hidden layer1 -> hidden layer2...
    // ...
}

What’s wrong with atomicAdd, or is there other problem occurred?