Hello,
I am writing a CUDA application that uses extensively shared memory and often transfers data between global GPU memory and CPU memory and also between global GPU memory and shared GPU memory. The results I get are ok while the number of threads per block I am using is inferior to 64. After that, I get “strange” results that to me suggest something went wrong with the memory (to give you an example, I have some value that should decrease by 2 at each execution; the result I am noticing after increasing the number of threads past 64 is that value decreasing by 3 or 6; it’s still somewhat regular behavior, but wrong behavior none the less).
I am running CUDA in Visual Studio 2008 on Windows 7 64 bits, the standard Debug x64 configuration from the template project. My GPU is a GeForce 8800 GT.
The different threads run independent code that reads however some common shared memory area.
Even with 256 threads I don’t surpass the 16Kb of shared memory.
In Emulated Debug mode, everything runs ok.
Does anyone have an idea of what I could be doing wrong?
Thanks
Bogdan