coalesced reads


I wrote a simple load, store kernel:

__global__ void loadstore(const double *A, const double *C, int numElements)
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
        C[i] = A[i];

and the profiler gives,

Invocations                      Event Name         Min         Max         Avg
	Kernel: vectorAdd(double const *, double const *, double*, int)
          1                     gld_request        1000        1000        1000
          1              l1_global_load_hit           0           0           0
          1             l1_global_load_miss        1984        1984        1984

          1                gld_transactions        2048        2048        2048
          1    gld_transactions_per_request      2.112000    2.112000    2.112000

Am I correct to expect gld_transactions_per_request to be approximately equal to 2 because there are 2 lines of cache read for every warp load instruction ? assuming perfectly coalesced reads of floating point doubles, which is the case.

Additionally, why is gld_transactions_per_request not exactly equal to ( l1_global_load_miss + l1_global_load_hit)/gld_request which is not exactly equal to gld_transactions / gld_request ??

and why is gld_transactions not exactly equal to ( l1_global_load_miss + l1_global_load_hit) ? It doesn’t make any sense really for them to all be different, if they’re measuring the same quantities !


My expanation is that you are reading doubles, which is 1 request but requires 2x more fetches to serve the data for the warp. Thus you have 1000 gld_requests, but approx 2x more transactions. Also, have you made sure you have alignment correct? There might be 1 extra transaction per warp to handle incorrect alignment.

Yes, the alignment is correct. This is not a misaligned read, in the kernel:

A[id] = B[id]

I was more wondering if gld_transactions_per_request was the correct metric to measure if my reads are uncoalesced. I was wondering this because it does not correspond exactly to (l1_global_load_miss + l1_global_load_hit)/gld_request which is not exactly equal to gld_transactions / gld_request. Maybe it is measuring something different?

regarding the correct ratio, it is indeed 1.0 for fp32 and 2.0 for fp64

page 48.