What information does "gld_request" provide? (cudaProf Counter)

Hi All,

I am trying to figure out the difference between gld_32/64/128 and gld_request (performance counters in cudaProf).

For doing so, I wrote a very simple kernel:

__global__ void kernel(float *d_arr) { int i=(blockIdx.x*blockDim.x)+threadIdx.x; if(i<SIZE); d_arr[i]++; }

I am just incrementing each value of a float array of size, say 32768.

My kernel launch configuration is 60, 512. Register usage per thread is 2. Thus, occupancy is 1.

The output from cudaProf is:
gld_64=192 and gld_request=32.

The value of gld_64 makes sense, since it is the sum of 3 SMs. Thus, each SM has 64 global memory load transactions (all access being coalesced). As on each SM 1024 threads can run which mean 64 half warps, and 1 memory transaction for each half warp, thus the value of gld_64 = 64. Perfect.

But what does gld_request mean? And, is there any way to find out the number of coalesced and uncoalesced memory accesses on GPU of compute capability 1.3, e.g, GTX 280?

I am at a fix. I want to come up with an analytical model for performance characterization on GPUs, hence understanding of these counters is very important to me.

Please help.