Hello,
I have a kernel that makes intensive usage of memory in a “gather” manner, that is, each thread performs one or multiple reads to global memory, but the memory addresses are not necessarily coalesced between the threads. Usually, the memory addresses read across threads have a certain locality (i.e. they are not very sparse).
When profiling with ncu, I get exactly the same SOL for compute and memory.
When I look at the throughput breakdown, I see that the memory is limited by “L1: Lsuin Requests”, and the Compute is limited by “SM: Inst Executed Pipe Lsu”.
My understanding is that, when an SM executes a global memory instruction for a warp, a request containing the information of all participant threads of the warp is sent to the L1, then, the L1 has multiple pipelined processing stages.
I assume that the profiler is counting the throughput of these requests both as Compute and Memory. Is that assumption correct? Therefore, would this kernel be considered memory or compute bound?
To improve the performance of the kernel, should I try to reduce the number of requests sent to the memory subsystem?
Thank you