Local loads and stores

I’m working on some very large kernels whose performance is dominated by local memory access. I’ve noticed something very odd in my profiles: the counter for local_store is always larger than the counter for local_load. That’s true for all my kernels, and often by a large margin. The number of local stores is typically 3-4 times the number of local loads. That implies to me that the GPU is wasting a lot of time by storing values that will never be read. Is that really the case, or am I misinterpreting these numbers? And if I’m interpreting them correctly, is there anything I can do to make the compiler produce more efficient code for memory access?

Peter

The load and store counters have a strange asymmetry where the load counter counts requests, while the store counter counts data transferred in units of 16 bytes. Check the profiler documentation. So probably everything is fine with your code.

The load and store counters have a strange asymmetry where the load counter counts requests, while the store counter counts data transferred in units of 16 bytes. Check the profiler documentation. So probably everything is fine with your code.

Thanks! It looks like the documentation included with the toolkit needs to be updated. Here’s the description from Compute_Profiler_3.1.txt:

[codebox]local_load : Number of executed local load instructions per warp in a SM

local store : Number of executed local store instructions per warp in a SM[/codebox]

Peter

Thanks! It looks like the documentation included with the toolkit needs to be updated. Here’s the description from Compute_Profiler_3.1.txt:

[codebox]local_load : Number of executed local load instructions per warp in a SM

local store : Number of executed local store instructions per warp in a SM[/codebox]

Peter