It looks like you have two problems. One is that all threads in a block will attempt to execute this simultaneously, leading to a race condition:
c_n[i] = c_n[i] + 1.0;
If each thread incremented it individually, (which might happen in emulation mode) then it might get incremented by the number of threads in a block. But if all threads in a warp load the same value for c_n[i] and increment it, and then attempt to store the new value, it could get incremented once per warp. But also possible is if a warp loads the old value in between the load and store of the other warp, which would effectively cause the increment to be even less. This is totally dependent on how the threads happen to be scheduled.
The other potential problem is basically the same problem but at the block level. Two or more blocks could load the values, update them separately, and then write the results back. It’s also possible that one block could be reading them while another block is writing them, making the values different even from each other.