cudaprof local load/store counters

In cudaprof, we have gst_32/64/128b and gld_32/64/128b counters.

Global store counter

Global load counter

Will this also be true for local load/store counters? Say if local store is coalesced to 128, then counter is incremented by 8 (difference with global stores is that local counter doesn’t separate 32-64-128 values).

Some testing showed that yes, it will also be true for local memory counters. Simple kernel

int ithread = blockIdx.x * 32 + threadIdx.x;

double my_array[256];

double temp = 0.0;

for (int i = 0; i < 256; i++){


    my_array[i] = 0.32;


result[ipotok] = temp;

was run in 32 threads per block, 54 blocks. cudaprof gives 512 local loads per block (that is, 16 loads per thread, (256 * sizeof(double) ) / 16 = 128 bytes; therefore, accesses are coalesced to 128 bytes transactions) BUT it gives as much as 8 times more local stores (4096 stores per block). Therefore, coalesced local memory accesses are counted the same way as global memory accesses.