how to find the amount of data transfer?

I am new to using CUDA. I have a question about the amount of data transfer in gld_coherent loads that CUDA profiler shows.

CUDA profiler stats for one of my kernels is as follows:
method=[ _Z18gpu_forward_kernelPfS_S_PiS_S_iiiiii ] gputime=[ 320.864 ] cputime=[ 344.000 ] occupancy=[ 0.333 ] gld_coherent=[ 3000 ] gld_incoherent=[ 0 ] gst_coherent=[ 16 ] gst_incoherent=[ 0 ]

For the 3000 gld_coherent loads, how can I find how much data is being actually accessed from device DRAM? Any help is appreciated. Thanks.

You can use the counters:

(these are available only for GPUs with compute capability 1.2 or higher)

gld_32/64/128b : Number of 32 byte, 64 byte and 128 byte global memory load transactions

gst_32/64/128b : Number of 32 byte, 64 byte and 128 byte global memory store transactions

Refer the section “Interpreting profiler counters” in cudaprof.html in the CUDA 2.3 version.