I’m rather new to CUDA development and I’ve stumbled on to some odd performance issues when using striped vs coalesced memory access. The kernel I’m playing with is as follows:
I’m calling this kernel with 8192 blocks of 128 threads (2^20 or ~1mil threads total).
As is the kernel runs in 0.0256 ms but if I comment out line 18 it takes 9.9717 ms (390 times longer).
By removing line 18 we’re going from coalesced to striped access, but I was under the impression that the memory manager would just read the whole block its trying to access regardless. Since we’re striping over the same range shouldn’t we expect the same number of memory accesses and similar performance?
GTX 980 Ti