I do have this problem too and it’s really confusing me.
In my codes, there is none declaration of
variable/function. But the profiler reported that there were 40 of static shared memory.
The program run fast however, despite executed on mobile G210M (From 2800ms to 8-10ms on CUDA)…so i’m very excited to see when i compare it on my GTX 560Ti at home.
Surprisingly, the code run slower at just 20ms. The profiler reported that any shared memory (dynamic/static) has null value. This comparation of value really tickle me.
FYI, on G210M profiler and the toolkit was 3.1 version, and for the GTX560Ti is latest 4.0 version. And the codes about DFT (Discrete Fourier Transform), using two level nested loop each has 2048 iteration.