big blockNum leads to error on shared memory block num and shared memory

Hi, i use <= 192 blocks and always get correct result; >=256 always wrong a little.
I guess the problem lies on shared memory unenough when multiblocks are active on one processor (to hide latency).
The safest way is to use only 256k/4k=64 blocks, then even if ALL blocks eat the shared memory, it’s ok; but the best performance falls on 256 blocks, which tends to wrong a little.
Is there any explicit strategy on shared memory usage by multi blocks?


Your question is unclear. It’s not clear what you mean by “always wrong a little”.


There may be something off in your indexing. I can confirm that grids of 512x512 blocks have returned correct results.

Also, do you run a kernel with 256 blocks immediately after the one with 256 blocks? It’s possible that the second one doesn’t really execute (due to register pressure or some other issue) and just by coincindence you copy from device to host the result that was computed by previous kernel invocation. That way most of the result is correct from the previous computation. I’ve had this happen, one way to check is to look in the Profiler log and see how much gpu time was spent (if it’s on the order of a few microseconds, your kernel didn’t run).


i get count of data(in my former post), should be 1000, but got 998 or so. thanks!

I suspect a bug in your code. I have run apps that use 2^16-1 thread blocks successfully. The hardware should support 2^32-1 thread blocks per kernel invocation.