I’m working on some very large kernels whose performance is dominated by local memory access. I’ve noticed something very odd in my profiles: the counter for local_store is always larger than the counter for local_load. That’s true for all my kernels, and often by a large margin. The number of local stores is typically 3-4 times the number of local loads. That implies to me that the GPU is wasting a lot of time by storing values that will never be read. Is that really the case, or am I misinterpreting these numbers? And if I’m interpreting them correctly, is there anything I can do to make the compiler produce more efficient code for memory access?
The load and store counters have a strange asymmetry where the load counter counts requests, while the store counter counts data transferred in units of 16 bytes. Check the profiler documentation. So probably everything is fine with your code.
The load and store counters have a strange asymmetry where the load counter counts requests, while the store counter counts data transferred in units of 16 bytes. Check the profiler documentation. So probably everything is fine with your code.