we know too few blocks/threads can’t hide latency; too many threads cause register useup; too many blocks, however, won’t cause compile trouble.
but in my tests, >1024 blocks are constantly slower than <512 blocks. why?
is there “block instantialization” or “block destruction” cost? thanks!
Maybe it’s because the bottleneck of those blocks accessing the global mem.