I’m trying to understand the performance bottleneck on the M1000M in the situation that follows. This is for an algorithm that is nearly embarrassingly parallel with every thread identical and independent of every other thread.
The Visual Profiler indicates stalls due to memory dependency. Global Memory Alignment and Access Pattern shows 32 transactions/access vs 4 or 8 ideal for loads. The disassembly window shows numerous places where “L2 transactions = 100% of max I2 transactions”. I’m not sure what this last part means.
Experimentally, I find that doubling the number of registers, doubling the number of threads, or running more than one kernel per SM yields little improvement. However, if I run four identical kernels, one on each available SM, I get a almost a linear speed up (run-time is about the same to run four as one). If the L2 is shared among the SMs, this tells me that neither global memory bandwidth nor L2 transactions is the limitation.
If I comment a few load statements, initially I see only a small improvement; however once I comment enough of them, all of sudden the run-time decreases by 10x. Finally, I attempted to reorder most memory accesses so that each thread would access four or eight bytes contiguous across threads. Although, I never quite got the code working, I also didn’t see a performance improvement, so I abandoned that optimization.
Based on this, I believe either there aren’t enough load units, or the problem has something to do with L2, but I don’t understand the issue. If this is correct (or if it isn’t), is there anything I can do besides the obvious of decreasing the number of memory accesses?