I am currently playing with the Compute Visual Profiler on a GTX480, and am specifically interested in calculating global memory throughput.
I noticed that the old perf counters (e.g., gst_32b, gst_64b, gst_128b, etc.) are missing.
The only counters that I see that appear relevant are: “gld request”, “gst request”, “l1 global load hit”, and “l1 global load miss”.
If I had to guess, the old counters were no longer necessary since any type of access (either uncoalesced or coalesced) will generate a large block
transfer from main memory into the L1 global memory cache.
However, the question is, how large is the block size of the L1 cache?
I ran one of the existing apps (Black-Scholes) and ended up with this repeating sample in the profiler:
GPU Time: 2970us
L1 global load miss: 50172
The Black-Scholes benchmark itself reports about 5.1435 GOptions/sec, which corresponds to about 51.4GB/sec of global memory bandwidth.
However, if I assume something like a 128B transfer, the actual data rate would be: (50172 x 128B x 1e-9) / (.00297sec) = 2.16GB/sec
which is way off the mark.
If I assume something like 4096B per transfer, the rate would be about ~69.2GB/sec, which seems closer, although still off the mark.
Has anyone else tried to figure out global memory bandwidth, and if they’ve seen something like this?