Hi All,
I am trying to figure out the difference between gld_32/64/128 and gld_request (performance counters in cudaProf).
For doing so, I wrote a very simple kernel:
__global__ void kernel(float *d_arr)
{
int i=(blockIdx.x*blockDim.x)+threadIdx.x;
if(i<SIZE);
d_arr[i]++;
}
I am just incrementing each value of a float array of size, say 32768.
My kernel launch configuration is 60, 512. Register usage per thread is 2. Thus, occupancy is 1.
The output from cudaProf is:
gld_64=192 and gld_request=32.
The value of gld_64 makes sense, since it is the sum of 3 SMs. Thus, each SM has 64 global memory load transactions (all access being coalesced). As on each SM 1024 threads can run which mean 64 half warps, and 1 memory transaction for each half warp, thus the value of gld_64 = 64. Perfect.
But what does gld_request mean? And, is there any way to find out the number of coalesced and uncoalesced memory accesses on GPU of compute capability 1.3, e.g, GTX 280?
I am at a fix. I want to come up with an analytical model for performance characterization on GPUs, hence understanding of these counters is very important to me.
Please help.