About DRAM stats

mahmood.nt · February 18, 2020, 6:22pm

I have test a simple CuBLAS matrix multiplication by calling

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

on 2 squared matrices with both nvprof and nsightcompute.

Regarding the dram read bytes, I see some large differences for some inputs while small differences on some other input sizes.

For example, for two 2000x2000 matrices, I see
Nsight

volta_sgemm_128x32_nn, 2020-Feb-18 09:20:10, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__sectors_read.sum                                                          sector                      5,947,657
    dram__sectors_write.sum                                                         sector                      2,007,388
    ---------------------------------------------------------------------- --------------- ------------------------------

Nvprof

Kernel: volta_sgemm_128x32_nn
          1                    dram_read_transactions               Device Memory Read Transactions     7109725     7109725     7109725
          1                   dram_write_transactions              Device Memory Write Transactions     2219638     2219638     2219638

On average, reads are 5.9M on Nsight and 7.1M on nvprof.

However, for two 4000x4000 matrices I see

volta_sgemm_128x32_nn, 2020-Feb-18 09:20:43, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__sectors_read.sum                                                          sector                     36,811,361
    dram__sectors_write.sum                                                         sector                      2,056,482
    ---------------------------------------------------------------------- --------------- ------------------------------

and

Kernel: volta_sgemm_128x32_nn
          1                    dram_read_transactions               Device Memory Read Transactions    76945237    76945237    76945237
          1                   dram_write_transactions              Device Memory Write Transactions     2013126     2013126     2013126

So, on average, DRAM reads on Nsight are 36M while on Nvprof are 76M.
How can we explain the differences? Seems that nvprof is counting somethings as double. Maybe nsight flushes some stats at the end of something while nvprof doesn’t. I don’t have more details about these two.

mahmood.nt · February 20, 2020, 6:06pm

Excuse me, I see some large difference while comparing these two tools and that confuses me because I don’t know which one is correct then.

Please see:

nsight

volta_sgemm_32x32_sliced1x4_nn, 2020-Feb-20 17:34:33, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__sectors_read.sum                                                          sector                          1,529
    dram__sectors_write.sum                                                         sector                          2,384
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum                               inst                         12,288
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum                               inst                        262,144
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum                               inst                         16,384
    ---------------------------------------------------------------------- --------------- ------------------------------

nvprof

Kernel: volta_sgemm_32x32_sliced1x4_nn
          1                    dram_read_transactions               Device Memory Read Transactions           3           3           3
          1                   dram_write_transactions              Device Memory Write Transactions          23          23          23
          1                             flop_count_sp   Floating Point Operations(Single Precision)      552960      552960      552960

As you can see, the number of FP operations are the same
552,960=12,288+2*262,144+16,384

How this large difference for DRAM read/write transactions are explained?

Sanjiv.Satoor · February 21, 2020, 4:41am

In order to make hardware performance counter value more deterministic, Nsight Compute by default flushes all GPU caches before each replay pass. nvprof does not do this. This can be one reason for the differences in metric values between Nsight Compute and nvprof. The –cache-control none option can be used to disable flushing of any GPU caches by Nsight Compute.

Also for these specific DRAM metrics the underlying hardware counters used by Nsight Compute and nvprof are different. The Nsight Compute metrics are more accurate.

mahmood.nt · February 21, 2020, 7:24am

Which nsight version do you mean? I don’t have that option.

$ ~/NVIDIA-Nsight-Compute-2019.4/nv-nsight-cu-cli --cache-control none
unrecognised option '--cache-control'

usage: nv-nsight-cu-cli [options] [program] [program-arguments]

felix_dt · February 21, 2020, 7:28am

Cache control was introduced in 2019.5

mahmood.nt · February 21, 2020, 7:57am

Right. With “–cache-control none” I see

volta_sgemm_32x32_sliced1x4_nn, 2020-Feb-21 08:30:22, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__sectors_read.sum sector 484
dram__sectors_write.sum sector 664

May I know what is the logic behind flushing and not flushing the cache stats? I mean how that is interpreted? Is is good to reset counters before each replay or not?

felix_dt · February 21, 2020, 8:07am

Flushing the cache between measurement replay passes is done to improve metric accuracy and determinism.

For example, the values of cache- or DRAM-related metrics is often directly impacted by the content of the cache before the measurement. This means that measuring the same metric for the same kernel in different points in your application will result in different values.

Furthermore, and more importantly, a non-deterministic cache content will create varying results across replay passes for the same kernel instance. Without flushing, it would make a difference if the cache-hit metric is collected in the first or second replay pass, since by the second pass the cache content will likely have changed. Certain metrics are composed of several sub counters which need to be collected across passes, and collecting those for varying input data would result in inaccurate values.

There is no mechanism available to restore the cache to a certain prior value, which is why flushing it completely provides the highest level of determinism. In cases where measuring with an app-primed cache is important, you can use the new cache control to disable this feature.

Topic		Replies	Views
Consistency of data collected by nvprof and nsight compute Nsight Compute	2	483	July 30, 2023
Nsight and nvprof results have large differences Nsight Compute	9	1320	November 26, 2019
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Nsight Compute	4	719	August 13, 2019
nsight-compute's profiling result is different from nvprof's Nsight Compute	5	714	October 12, 2021
How do i get some of the nvprof metrics in insight? Nsight Compute	0	773	June 2, 2021
Using Nsight Compute to Inspect your Kernels Technical Blog	3	1819	January 8, 2026
Huge data file is generated by Nsight Nsight Compute	7	1044	October 23, 2019
Nvprof and visual profiler about memory and cache access？ Jetson Nano nsight	10	2156	March 31, 2022
Can't Get NCU GUI To Import Properly Nsight Compute	8	1545	October 5, 2020
Dram__bytes_read.sum is !(n/a) Nsight Compute	2	211	March 20, 2025

About DRAM stats

Related topics