I’m trying to measure the clock cycles that my GPU takes to read data from its global memory. I also wrote a simple code to time it, but I’m not sure if it’s accurate because the numbers are not even close to what stated in the programming guide.
I’ve read from CUDA Programming Guide that a latency to access global memory is around 400-800 clock cycles per instruction. I’m wondering if it is per thread or per warp? Are there any differences?