Memory throughput on GTX480 (cudaprof question) how to calculate memory throughput from GST/GLD

Hi folks,

I am currently playing with the Compute Visual Profiler on a GTX480, and am specifically interested in calculating global memory throughput.

I noticed that the old perf counters (e.g., gst_32b, gst_64b, gst_128b, etc.) are missing.

The only counters that I see that appear relevant are: “gld request”, “gst request”, “l1 global load hit”, and “l1 global load miss”.

If I had to guess, the old counters were no longer necessary since any type of access (either uncoalesced or coalesced) will generate a large block
transfer from main memory into the L1 global memory cache.

However, the question is, how large is the block size of the L1 cache?

I ran one of the existing apps (Black-Scholes) and ended up with this repeating sample in the profiler:
GPU Time: 2970us
L1 global load miss: 50172

The Black-Scholes benchmark itself reports about 5.1435 GOptions/sec, which corresponds to about 51.4GB/sec of global memory bandwidth.

However, if I assume something like a 128B transfer, the actual data rate would be: (50172 x 128B x 1e-9) / (.00297sec) = 2.16GB/sec
which is way off the mark.

If I assume something like 4096B per transfer, the rate would be about ~69.2GB/sec, which seems closer, although still off the mark.

Has anyone else tried to figure out global memory bandwidth, and if they’ve seen something like this?

Thanks,
–Eric

I think the reasonable explaination would be what tmurray said, the SDK samples are not optimized for Fermi.

Therefore the performance is not optimal and you probably can’t count on it in order to calculate other things (like the L1 params etc…)

my 1 cent

eyal

Are you saying that the performance counter data is inaccurate?

This particular workload (tuned or otherwise) is not what I am trying to measure, I am just trying to figure out how to infer the global memory bandwidth.

I may have mis-understood you. I thought that based on the SDK performance you make assumptions as to how the L1 or other

things work. If that is the case, than you can not make those assumptions. That is what I meant.

eyal

The programming guide does does state that L1 cache lines are 128B. The profiler docs state clearly that counters are only counted once per warp and only counted on multiprocessor 0.

Following this logic, an estimate of the L2->L1 bandwidth obtained seems reasonable as:

15 MPs * 128B/miss * 50172 warp misses/MP * 32 threads/warp * 1e-9 GB/B / (.00297sec) = 1037.9 thread-GB/s ??? That seems even less reasonable.

It makes sense, as not every thread uses all of that 128B read. We need an additional conversion factor to take out the thread unit and account for what fraction of the 128B read is actually used per thread. What is that for the benchmark you are running?

4B - 1/(32 threads) =>32.43 GB/s

8B - 1/(16 threads) => 64.8688

None of those matches up with your projected 51 GB/s. I guess I’ll just have to quote the profiler docs here :)

People will just have to count memory accesses themselves to determine bandwidth. I never found the bandwidth counting on G200 anyways: It only counted “real” bandwidth, not “useful” bandwidth, and it ignored texture reads.