Problem when using more than 64 threads per block


I am writing a CUDA application that uses extensively shared memory and often transfers data between global GPU memory and CPU memory and also between global GPU memory and shared GPU memory. The results I get are ok while the number of threads per block I am using is inferior to 64. After that, I get “strange” results that to me suggest something went wrong with the memory (to give you an example, I have some value that should decrease by 2 at each execution; the result I am noticing after increasing the number of threads past 64 is that value decreasing by 3 or 6; it’s still somewhat regular behavior, but wrong behavior none the less).

I am running CUDA in Visual Studio 2008 on Windows 7 64 bits, the standard Debug x64 configuration from the template project. My GPU is a GeForce 8800 GT.
The different threads run independent code that reads however some common shared memory area.
Even with 256 threads I don’t surpass the 16Kb of shared memory.

In Emulated Debug mode, everything runs ok.

Does anyone have an idea of what I could be doing wrong?



Quite likely you are missing to synchronize the threads with __syncthreads() somewhere.

Indeed, I do not use syncthreads, but why should I ? (given the fact that I do not write the same memory area concurrently, only read it).


So, in theory, I wasn’t writing concurrently the same memory area, but in practice I was. I tried synchronizing the threads almost everywhere and the problem became obvious. Thank you.