I have test a simple CuBLAS matrix multiplication by calling
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
on 2 squared matrices with both nvprof and nsightcompute.
Regarding the dram read bytes, I see some large differences for some inputs while small differences on some other input sizes.
For example, for two 2000x2000 matrices, I see
Nsight
volta_sgemm_128x32_nn, 2020-Feb-18 09:20:10, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__sectors_read.sum sector 5,947,657
dram__sectors_write.sum sector 2,007,388
---------------------------------------------------------------------- --------------- ------------------------------
Nvprof
Kernel: volta_sgemm_128x32_nn
1 dram_read_transactions Device Memory Read Transactions 7109725 7109725 7109725
1 dram_write_transactions Device Memory Write Transactions 2219638 2219638 2219638
On average, reads are 5.9M on Nsight and 7.1M on nvprof.
However, for two 4000x4000 matrices I see
volta_sgemm_128x32_nn, 2020-Feb-18 09:20:43, Context 1, Stream 7
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
dram__sectors_read.sum sector 36,811,361
dram__sectors_write.sum sector 2,056,482
---------------------------------------------------------------------- --------------- ------------------------------
and
Kernel: volta_sgemm_128x32_nn
1 dram_read_transactions Device Memory Read Transactions 76945237 76945237 76945237
1 dram_write_transactions Device Memory Write Transactions 2013126 2013126 2013126
So, on average, DRAM reads on Nsight are 36M while on Nvprof are 76M.
How can we explain the differences? Seems that nvprof is counting somethings as double. Maybe nsight flushes some stats at the end of something while nvprof doesn’t. I don’t have more details about these two.