I’m using CUDA 5 in C++ with a GTX 950M on Windows 10.
Just for test purpose, I allocated “double” memory addresses as much as threads number (1024) on device (with cudaMalloc) and then I simply increment these (with “variable[threadIdx.x] += 1”) in a kernel, but results are strangely not corrects (the sum of all the array must normally be the total number of blocks * threads, which is not). I guess that maybe for optimization purpose, the same thread of two different blocks can interfer?
More generally, my problem is that I must allocate too much memory (a big integers’ array with 6 dimensions representing approximately 300MB) to even do it once per thread. So can I somehow handle the same memory addresses with all blocks/threads, somehow? Maybe using wisely a device_vector? (With for example one thread that only manage some kind of memory transfer between the device_vector and the only big array, which permits to progressively remove elements from vector?)
In short term, I prefer a solution on CUDA 5, but if a solution only exists with later versions of CUDA (5.5 or later), I’m also interested.
Thanks a lot for every answer, and sorry if I look stupid, I’m new in this beautiful world of GPU processing.