I’m new to CUDA, and just wanted to write a simple program that allocate memory for an integer on the device. Then runs a kernel program with grid size 256 and 256 threads per block.
There’s no guarantee that all threads will read/write to value sequentially. In fact, it’s almost guaranteed that they won’t, as you’ve experienced in your example. Furthermore, it’s only possible to synchronize threads within the same block. For inter-block synchronization, you need to relaunch kernels.
Cool thanks. That makes perfect sense. Would the correct way to do it be to use a mutex to lock the variable, then inc, then unlock? Or does CUDA has a different way of doing this?
You have to imagine as if all threads are running simultaneously. If they all read value at the same time, increment their copy and then write the temporary back, *value should read 1 :)
It is 22 in your case because not all threads actually run simultaneously, so some will run after others and increment the value there. If you run your app multiple times, you should randomly get different values because threads aren’t scheduled the same way every time.
There are no mutexes or such in CUDA. With ~20,000 threads running concurrently on the device, anything that serializes the execution would absolutely destroy the performance anyway. CUDA is best for data-parallel algorithms where each thread operates independently on its own piece of data (i.e. one array element).
For tasks like summing an array, you can perform a reduction (see the SDK example and whitepaper).
If you absolutely must have multiple threads access the same variable (knowing the performance implications if too many threads access the same variable at once…) you can use Atomic operations. See the programming guide for all the various atomic operations. Using them requires compiling for the sm11 or later architecture and a compute 1.1 or newer card (anything but the 8800 GTX/Tesla 800 series).