Hi
thank you very much for your replies.
The L1/Shared Utilization refers to the clock set at runtime which makes sense anyway. So it always shows e.g. “High (8)” from 562 MHz, 823 MHz to 875 MHz. This is also true for alu_fu_utilization and ldst_fu_utilization. dram_utilization probably does not take ECC into account, see below.
Of course, L1/Shared Utilization also includes L1 transactions, so for a pure metric one has to compute it based on the shared memory throughput values as you suggested.
And as njuffa said there are no ECC penalties on shared memory (might be also the case for L1 and L2 cache?!).
I also learned from you that there are DRAM specific overheads like for read/write turn-arounds, limited re-orderings and address data multiplexing in the device memory. I guess this remains for HBM2 in the Pascal architecture, since it is just a 3D stacked DRAM?!
For a K80 the DRAM peak bandwidth would be 5010 MHz * 384/8 byte = 240.480 GB/s.
I played with the kernel instruction at the benchmark from How to Access Global Memory Efficiently in CUDA C/C++ Kernels (without offsets and strides)
a[i] = a[i]+1; # 140 GB/s (ECC enabled), measured with cudaEvent
a[i] = 1; # 170 GB/s (ECC enabled), measured with cudaEvent
The ECC penalty should be around 20%, so that peak bandwidth would be now 192 GB/s. However, these values base on cuda event records which might be not very reliable, so nvprof to the rescue.
a[i] = a[i]+1; # 165 GB/s (ECC enabled), nvprof
a[i] = 1; # 190 GB/s (ECC enabled), nvprof
Based on the 192 GB/s theoretical peak bandwidth, the efficiencies range from 86% to 99% here, depending on the read+write or writes-only scenario.
In detail:
nvprof -m dram_write_throughput,dram_read_throughput,dram_utilization,global_replay_overhead,global_cache_replay_overhead ./coalescing
Metric Description Min Max Avg
Device Memory Write Throughput 82.103GB/s 83.095GB/s 82.493GB/s
Device Memory Read Throughput 80.365GB/s 81.524GB/s 80.884GB/s
Device Memory Utilization High (8) High (8) High (8)
Global Memory Replay Overhead 0.000000 0.000000 0.000000
Global Memory Cache Replay Overhead 0.000000 0.000000 0.000000
nvprof -m dram_write_throughput,dram_read_throughput,dram_utilization,global_replay_overhead,global_cache_replay_overhead ./coalescing_writes_only
Device Memory Write Throughput 186.69GB/s 189.80GB/s 188.67GB/s
Device Memory Read Throughput 19.751MB/s 40.277MB/s 28.610MB/s
Device Memory Utilization High (9) High (9) High (9)
Global Memory Replay Overhead 0.000000 0.000000 0.000000
Global Memory Cache Replay Overhead 0.000000 0.000000 0.000000
(K80, 2505 MHz effective DDR memory clock, 823 MHz GPU clock, ECC enabled, 34 runs)
Best Regards.
PS on my shared memory example for the sake of completeness: Although I cannot provide code at the moment, here are some profiler results (working at 0.823 GHz). I know it is not sufficient for any conclusions.
When I find some time, I’ll extract the code, where I used shared memory (SoA pattern) to circumvent local memory (AoS pattern) and I would open a separate thread for discussion.
. Shared Load Transactions 1080688640
Shared Store Transactions 1074085888
Shared Memory Load Transactions Per Request 1.006147
Shared Memory Store Transactions Per Request 1.000000
Shared Memory Load Throughput 930.40GB/s
Shared Memory Store Throughput 924.71GB/s
L1/Shared Memory Utilization High (8)
L2 Hit Rate (L1 Reads) 50.00%
L2 Throughput (L1 Reads) 3.4229MB/s
L2 Read Transactions (L1 read requests) 1872
L2 Write Transactions (L1 write requests) 936
L2 Throughput (L1 Writes) 1.7115MB/s
I also used gpumembench on github to get the peak values for shared memory bandwidth, which has shown a perfect match with the theoretical peak at least on our K80.