Hello,
I wrote a simple load, store kernel:
__global__ void loadstore(const double *A, const double *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
C[i] = A[i];
}
}
and the profiler gives,
Invocations Event Name Min Max Avg
Kernel: vectorAdd(double const *, double const *, double*, int)
1 gld_request 1000 1000 1000
1 l1_global_load_hit 0 0 0
1 l1_global_load_miss 1984 1984 1984
...
1 gld_transactions 2048 2048 2048
1 gld_transactions_per_request 2.112000 2.112000 2.112000
Am I correct to expect gld_transactions_per_request to be approximately equal to 2 because there are 2 lines of cache read for every warp load instruction ? assuming perfectly coalesced reads of floating point doubles, which is the case.
Additionally, why is gld_transactions_per_request not exactly equal to ( l1_global_load_miss + l1_global_load_hit)/gld_request which is not exactly equal to gld_transactions / gld_request ??
and why is gld_transactions not exactly equal to ( l1_global_load_miss + l1_global_load_hit) ? It doesn’t make any sense really for them to all be different, if they’re measuring the same quantities !