dear all,
I’m using CUDA on C1060 GPU.
I wrote a kernel which I expected to read (3^14 * 2) of float from global memory.
the accesses were almost perfect coalesced.
so I expected the kernel generate (3^14 * 2 * 4)/64 = 597871 of 64-byte transactions.
but what I found in visual profilier is 59552 of gld_64,
which is only around 1/10 of my expected value.
is there any suggestion for me?
does gld_64 counter realy reflect the absolute number of memory transactions. Or, it is just relative values.
Thank you very much