Memory throughput on GTX480 (cudaprof question) how to calculate memory throughput from GST/GLD

anesotericeric · May 27, 2010, 2:10pm

Hi folks,

I am currently playing with the Compute Visual Profiler on a GTX480, and am specifically interested in calculating global memory throughput.

I noticed that the old perf counters (e.g., gst_32b, gst_64b, gst_128b, etc.) are missing.

The only counters that I see that appear relevant are: “gld request”, “gst request”, “l1 global load hit”, and “l1 global load miss”.

If I had to guess, the old counters were no longer necessary since any type of access (either uncoalesced or coalesced) will generate a large block
transfer from main memory into the L1 global memory cache.

However, the question is, how large is the block size of the L1 cache?

I ran one of the existing apps (Black-Scholes) and ended up with this repeating sample in the profiler:
GPU Time: 2970us
L1 global load miss: 50172

The Black-Scholes benchmark itself reports about 5.1435 GOptions/sec, which corresponds to about 51.4GB/sec of global memory bandwidth.

However, if I assume something like a 128B transfer, the actual data rate would be: (50172 x 128B x 1e-9) / (.00297sec) = 2.16GB/sec
which is way off the mark.

If I assume something like 4096B per transfer, the rate would be about ~69.2GB/sec, which seems closer, although still off the mark.

Has anyone else tried to figure out global memory bandwidth, and if they’ve seen something like this?

Thanks,
–Eric

eyalhir74 · May 27, 2010, 2:29pm

I think the reasonable explaination would be what tmurray said, the SDK samples are not optimized for Fermi.

Therefore the performance is not optimal and you probably can’t count on it in order to calculate other things (like the L1 params etc…)

my 1 cent

eyal

anesotericeric · May 27, 2010, 3:41pm

Are you saying that the performance counter data is inaccurate?

This particular workload (tuned or otherwise) is not what I am trying to measure, I am just trying to figure out how to infer the global memory bandwidth.

eyalhir74 · May 27, 2010, 6:35pm

I may have mis-understood you. I thought that based on the SDK performance you make assumptions as to how the L1 or other

things work. If that is the case, than you can not make those assumptions. That is what I meant.

eyal

MisterAnderson42 · May 28, 2010, 12:02pm

The programming guide does does state that L1 cache lines are 128B. The profiler docs state clearly that counters are only counted once per warp and only counted on multiprocessor 0.

Following this logic, an estimate of the L2->L1 bandwidth obtained seems reasonable as:

15 MPs * 128B/miss * 50172 warp misses/MP * 32 threads/warp * 1e-9 GB/B / (.00297sec) = 1037.9 thread-GB/s ??? That seems even less reasonable.

It makes sense, as not every thread uses all of that 128B read. We need an additional conversion factor to take out the thread unit and account for what fraction of the 128B read is actually used per thread. What is that for the benchmark you are running?

4B - 1/(32 threads) =>32.43 GB/s

8B - 1/(16 threads) => 64.8688

…

None of those matches up with your projected 51 GB/s. I guess I’ll just have to quote the profiler docs here :)

The performance counter values do not correspond to individual thread activity. Instead, these values represent events within a thread warp. For example, a divergent branch within a thread warp will increment the divergent_branch counter by one. So the final counter value stores information for all divergent branches in all warps. In addition, the profiler can only target one of the multiprocessors in the GPU,so the counter values will not correspond to the total number of warps launched for a particular kernel. For this reason, when using the performance counter options in the profiler the user should always launch enough threads blocks to ensure that the target multiprocessor is given a consistent percentage of the total work. In practice for consistent results, it is best to launch at least 2 times as many blocks as there are multiprocessors in the device on which you are profiling. For the reasons listed above, users should not expect the counter values to match the numbers one would get by inspecting kernel code. The values are best used to identify relative performance differences between un-optimized and optimized code.

People will just have to count memory accesses themselves to determine bandwidth. I never found the bandwidth counting on G200 anyways: It only counted “real” bandwidth, not “useful” bandwidth, and it ignored texture reads.

Topic		Replies	Views
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	6995	November 2, 2010
Missing gld_32/64/128b counter in GTX 480? CUDA Programming and Performance	6	9322	August 30, 2010
Calculating Gflops, memory bandwidth and visual profiler question performance calculation CUDA Programming and Performance	3	13628	October 30, 2023
texture cache memory bandwidth CUDA Programming and Performance	1	961	May 27, 2010
Instructions/byte profiler calculation CUDA Programming and Performance	4	1439	April 4, 2011
[Jetson-TK1] How to measure DRAM <-> L2 R/W bandwidth on Tegra K1? Jetson TK1	3	1676	August 12, 2015
Is the GDDR5X transfer size 256B on the GTX 1080 Ti? CUDA Programming and Performance	6	1211	September 28, 2017
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1698	June 11, 2008
Forcing store to write to HBM CUDA Programming and Performance cuda	3	419	August 29, 2023
Global memory access patterns - too slow CUDA Programming and Performance cuda , performance	6	1179	April 7, 2024

Memory throughput on GTX480 (cudaprof question) how to calculate memory throughput from GST/GLD

Related topics