Parallel processing of same memory address(es)

Hello there,

I’m using CUDA 5 in C++ with a GTX 950M on Windows 10.

Just for test purpose, I allocated “double” memory addresses as much as threads number (1024) on device (with cudaMalloc) and then I simply increment these (with “variable[threadIdx.x] += 1”) in a kernel, but results are strangely not corrects (the sum of all the array must normally be the total number of blocks * threads, which is not). I guess that maybe for optimization purpose, the same thread of two different blocks can interfer?

More generally, my problem is that I must allocate too much memory (a big integers’ array with 6 dimensions representing approximately 300MB) to even do it once per thread. So can I somehow handle the same memory addresses with all blocks/threads, somehow? Maybe using wisely a device_vector? (With for example one thread that only manage some kind of memory transfer between the device_vector and the only big array, which permits to progressively remove elements from vector?)

In short term, I prefer a solution on CUDA 5, but if a solution only exists with later versions of CUDA (5.5 or later), I’m also interested.

Thanks a lot for every answer, and sorry if I look stupid, I’m new in this beautiful world of GPU processing.


Up, just in case. (I’m surprised that a question like this remains unanswered, I’m pretty sure that this kind of problems must often occur, or am I totally wrong?)

variable[threadIdx.x] += 1

This does not work if multiple blocks run concurrently. It is a race condition. You need atomic operations.

1 Like

Thanks for your answer.