Writing/reading to shared global address


What’s wrong with this code:

__global__ void

count_threads_k(uint32_t* count)


	*count = *count + 1;



count_threads(uint32_t* count, int g, int t)


	count_threads_k<<<g, t>>>(count);


In an emu build the resulting value in count is g * t (i.e. if I let g = 2 and t = 10 then count = 20). In the release build I don’t get the value I expect. Do I have to lock reads and writes to the space pointed to by count within the kernel?


the problem is that on that your computation consists of multiple operations for example: read count, add 1, write count.
These operations are executed in parallel on the GPU.

g = 1, t = 16, count = 0

  1. 16 threads read count(0)
  2. 16 threads add 1 -> count = 1
  3. 16 threads write count(1) (this is serialized)

in the end count is 1 and not 16

Yep, you’re getting multithreaded read and write collisions.

Look at the “reduction” example in the SDK samples to see how you can do sums in parallel.
Alternatively, (easier, but less efficient) you can use global atomic functions to make sure each thread’s contribution is counted.