shared mem issue

I’ve encountered a weird behaviour in my kernel and after a day I’ve managed to fix the problem, but wanted
to confirm and share the knowledge :)
I’ve used the scanning code from the SDK in my kernel, in debug mode everything worked fine and the results
were the same as on the CPU. When moving to release I got garbage. It seems that I needed to allocate twice
the array size in shared memory, for the scanning to work. Once I did this release worked great :)

I guess it worked fine in debug since its in emulation mode and things are actually synced. On release however
I guess one block wrote over data in other block’s shared memory, hence the garbage :)
Assuming this is indeed the case, at first I thought that this is a bug in the GPU or CUDA, however after a thought
I guess CUDA can’t really check these things in runtime and it is the job of each block to stay in its smem boundrias.

Its also probably not CUDA’s fault for having dumb programmer like myself ;)


I am not totally sure if “Debug” ( I am being forced to use VS2005 for now) builds result in a device emulation. I have build a come of the example projects in debug more and found run time results with spit out performance metrics specific to my device which is a tesla c1060. However, it is true that n some cases, building a release version creates incorrect results.