too many blocks slow down the performance block/thread config

we know too few blocks/threads can’t hide latency; too many threads cause register useup; too many blocks, however, won’t cause compile trouble.
but in my tests, >1024 blocks are constantly slower than <512 blocks. why?
is there “block instantialization” or “block destruction” cost? thanks!

Maybe it’s because the bottleneck of those blocks accessing the global mem.