In an emu build the resulting value in count is g * t (i.e. if I let g = 2 and t = 10 then count = 20). In the release build I don’t get the value I expect. Do I have to lock reads and writes to the space pointed to by count within the kernel?
the problem is that on that your computation consists of multiple operations for example: read count, add 1, write count.
These operations are executed in parallel on the GPU.
Yep, you’re getting multithreaded read and write collisions.
Look at the “reduction” example in the SDK samples to see how you can do sums in parallel.
Alternatively, (easier, but less efficient) you can use global atomic functions to make sure each thread’s contribution is counted.