About DRAM stats

I have test a simple CuBLAS matrix multiplication by calling

cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);

on 2 squared matrices with both nvprof and nsightcompute.

Regarding the dram read bytes, I see some large differences for some inputs while small differences on some other input sizes.

For example, for two 2000x2000 matrices, I see
Nsight

volta_sgemm_128x32_nn, 2020-Feb-18 09:20:10, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__sectors_read.sum                                                          sector                      5,947,657
    dram__sectors_write.sum                                                         sector                      2,007,388
    ---------------------------------------------------------------------- --------------- ------------------------------

Nvprof

Kernel: volta_sgemm_128x32_nn
          1                    dram_read_transactions               Device Memory Read Transactions     7109725     7109725     7109725
          1                   dram_write_transactions              Device Memory Write Transactions     2219638     2219638     2219638

On average, reads are 5.9M on Nsight and 7.1M on nvprof.

However, for two 4000x4000 matrices I see

volta_sgemm_128x32_nn, 2020-Feb-18 09:20:43, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__sectors_read.sum                                                          sector                     36,811,361
    dram__sectors_write.sum                                                         sector                      2,056,482
    ---------------------------------------------------------------------- --------------- ------------------------------

and

Kernel: volta_sgemm_128x32_nn
          1                    dram_read_transactions               Device Memory Read Transactions    76945237    76945237    76945237
          1                   dram_write_transactions              Device Memory Write Transactions     2013126     2013126     2013126

So, on average, DRAM reads on Nsight are 36M while on Nvprof are 76M.
How can we explain the differences? Seems that nvprof is counting somethings as double. Maybe nsight flushes some stats at the end of something while nvprof doesn’t. I don’t have more details about these two.

Excuse me, I see some large difference while comparing these two tools and that confuses me because I don’t know which one is correct then.

Please see:

nsight

volta_sgemm_32x32_sliced1x4_nn, 2020-Feb-20 17:34:33, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    dram__sectors_read.sum                                                          sector                          1,529
    dram__sectors_write.sum                                                         sector                          2,384
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum                               inst                         12,288
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum                               inst                        262,144
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum                               inst                         16,384
    ---------------------------------------------------------------------- --------------- ------------------------------

nvprof

Kernel: volta_sgemm_32x32_sliced1x4_nn
          1                    dram_read_transactions               Device Memory Read Transactions           3           3           3
          1                   dram_write_transactions              Device Memory Write Transactions          23          23          23
          1                             flop_count_sp   Floating Point Operations(Single Precision)      552960      552960      552960

As you can see, the number of FP operations are the same
552,960=12,288+2*262,144+16,384

How this large difference for DRAM read/write transactions are explained?

In order to make hardware performance counter value more deterministic, Nsight Compute by default flushes all GPU caches before each replay pass. nvprof does not do this. This can be one reason for the differences in metric values between Nsight Compute and nvprof. The –cache-control none option can be used to disable flushing of any GPU caches by Nsight Compute.

Also for these specific DRAM metrics the underlying hardware counters used by Nsight Compute and nvprof are different. The Nsight Compute metrics are more accurate.

Which nsight version do you mean? I don’t have that option.

$ ~/NVIDIA-Nsight-Compute-2019.4/nv-nsight-cu-cli --cache-control none
unrecognised option '--cache-control'

usage: nv-nsight-cu-cli [options] [program] [program-arguments]

Cache control was introduced in 2019.5

Right. With “–cache-control none” I see

volta_sgemm_32x32_sliced1x4_nn, 2020-Feb-21 08:30:22, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__sectors_read.sum sector 484
dram__sectors_write.sum sector 664

May I know what is the logic behind flushing and not flushing the cache stats? I mean how that is interpreted? Is is good to reset counters before each replay or not?

Flushing the cache between measurement replay passes is done to improve metric accuracy and determinism.

For example, the values of cache- or DRAM-related metrics is often directly impacted by the content of the cache before the measurement. This means that measuring the same metric for the same kernel in different points in your application will result in different values.

Furthermore, and more importantly, a non-deterministic cache content will create varying results across replay passes for the same kernel instance. Without flushing, it would make a difference if the cache-hit metric is collected in the first or second replay pass, since by the second pass the cache content will likely have changed. Certain metrics are composed of several sub counters which need to be collected across passes, and collecting those for varying input data would result in inaccurate values.

There is no mechanism available to restore the cache to a certain prior value, which is why flushing it completely provides the highest level of determinism. In cases where measuring with an app-primed cache is important, you can use the new cache control to disable this feature.