Nsight and nvprof results have large differences

I see that NsightCompute results are very different from nvprof and to be honest, the results aren’t reliable, IMO.

I did three test with the MatrixMul example. In all tests, I used the following command to multiply faily large matrices and the devices are 2080Ti and TitanV.

./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048

For NsightCompute-2019.4, I ran the following commands to measure the IPC.

2080Ti:

$ CUDA_VISIBLE_DEVICES=1 /mnt/local/mnaderan/tools/NVIDIA-Nsight-Compute-2019.4/nv-nsight-cu-cli --quiet --metrics smsp__inst_executed.avg.per_cycle_active -f -o 2080ti.ipc ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce RTX 2080 Ti" with compute capability 7.5

MatrixA(2048,1024), MatrixB(1024,2048)
Computing result using CUDA Kernel...
done
Performance= 30.05 GFlop/s, Time= 142.942 msec, Size= 4294967296 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

TitanV:

$ CUDA_VISIBLE_DEVICES=0 /mnt/local/mnaderan/tools/NVIDIA-Nsight-Compute-2019.4/nv-nsight-cu-cli --quiet --metrics smsp__inst_executed.avg.per_cycle_active -f -o titanv.ipc ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "TITAN V" with compute capability 7.0

MatrixA(2048,1024), MatrixB(1024,2048)
Computing result using CUDA Kernel...
done
Performance= 9.47 GFlop/s, Time= 453.459 msec, Size= 4294967296 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.

Since 2080Ti is not compatible with nvprof, I only ran that with TitanV with this command

$ CUDA_VISIBLE_DEVICES=0 ~/cuda-10.1.168/bin/nvprof --metrics ipc -f -o titanv.ipc.nvvp ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048                                          [Matrix Multiply Using CUDA] - Starting...
==2939== NVPROF is profiling process 2939, command: ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
GPU Device 0: "TITAN V" with compute capability 7.0

MatrixA(2048,1024), MatrixB(1024,2048)
Computing result using CUDA Kernel...
done
Performance= 36.08 GFlop/s, Time= 119.056 msec, Size= 4294967296 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==2939== Generated result file: /home/mnaderan/sdk/0_Simple/matrixMul/titanv.ipc.nvvp

Results are shown below with the pitures:

nsight -> 2080Ti -> IPC=0.18 https://pasteboard.co/IFUqOgq.png
nsight -> TitanV -> IPC=0.39 https://pasteboard.co/IFUr9qQ.png
nvprof -> TitanV -> IPC=1.5 https://pasteboard.co/IFUrow1.png

Three files have been uploaded at https://gofile.io/?c=22r91Q

I see low IPC values for 2080 in other programs too. That is weird.
Any comment?

The difference you see is due to the fact that the metrics you measured don’t match exactly. CUPTI IPC is measured for the whole SM, while smsp__inst_executed is only per SM sub partition (“smsp”). Instead, you would want to use

sm__inst_executed.avg.per_cycle_active

You can also find more details on the same topic in this reply: https://devtalk.nvidia.com/default/topic/1042813/b/t/post/5289662/

We will fix the same in the comparison table https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison

Thanks for the reply. Considering smsp and sp,

nvprof -> TitanV -> IPC=1.5
is 4x of
nsight -> TitanV -> IPC=0.39

which is correct. With that SM IPC of 2080Ti will be 0.18*4=0.72
Isn’t that strange? Sounds like MM on 2080Ti is a memory bound application while on TitanV it is compute bound. There may be some difference in performance between these two devices. However, for for MM, I doubt.

@felix_dt

I see that some other metrics have different values for one device when using nvprof and nsight. Here are two examples:

1- For atomic_transactions metric, I ran two commands on TitanvV.

nvprof --kernels "KERNEL_NAME" --metrics atomic_transactions \
-f -o titanv.atomics.nvvp --log-file nvvp.log GMX_COMMAND

and

nv-nsight-cu-cli --quiet --kernel-regex "KERNEL_NAME" --metrics \
l1tex__t_set_accesses_pipe_lsu_mem_global_op_atom.sum,l1tex__t_set_accesses_pipe_lsu_mem_global_op_red.sum,l1tex__t_set_accesses_pipe_tex_mem_surface_op_atom.sum,l1tex__t_set_accesses_pipe_tex_mem_surface_op_red.sum \
-f -o titanv.atomics.nsight GMX_COMMAND

The results are

nvprof:
atomic transactions = 964611

nsight:
l1tex__t_set_accesses_pipe_lsu_mem_global_op_atom.sum = 0
l1tex__t_set_accesses_pipe_lsu_mem_global_op_red.sum = 2034905
l1tex__t_set_accesses_pipe_tex_mem_surface_op_atom.sum = 0
l1tex__t_set_accesses_pipe_tex_mem_surface_op_red.sum = 0

The picture is available at https://pasteboard.co/IGM7xhl.png
I can not find any relation between 2M and 960K.

2- For shared_transactions metric, I ran two commands on TitanvV.

nvprof --kernels "KERNEL_NAME" \
--metrics shared_store_transactions,shared_load_transactions \
-f -o titanv.atomics.nvvp --log-file nvvp.log GMX_COMMAND

and

nv-nsight-cu-cli --quiet --kernel-regex "KERNEL_NAME" --metrics \
smsp__inst_executed_op_shared_ld.sum,smsp__inst_executed_op_shared_st.sum \
-f -o titanv.atomics.nsight GMX_COMMAND

Please note that smsp is used for nsight. The results are

nvprof:
shared load transactions = 848045
shared store transactions = 426544

nsight:
smsp__inst_executed_op_shared_ld.sum = 809424
smsp__inst_executed_op_shared_st.sum = 520344

The picture is available at https://pasteboard.co/IGM71OC.png

The differences are not large as I saw in previous metric. However, there are some differences and I want to know if there is any reason for that or the difference is a kind of tolerable error.

Any thought?

@felix_dt
I found that l2 read/write transactions have large differences between nsight and nvprof. Commands are

nvprof --kernels "KERNEL_NAME" \
--metrics l2_read_transactions,l2_write_transactions \
-f -o titanv.l2.nvvp --log-file nvvp.log

and

nv-nsight-cu-cli --quiet --kernel-regex "KERNEL_NAME" \
--metrics lts__t_sectors_op_read.sum,lts__t_sectors_op_write.sum \
-f -o titanv.l2.nsight

The picture can be seen at https://pasteboard.co/IGNsSmN.png

Results are:

nvprof:
L2 read transactions = 2611446
L2 write transactions = 3330288

nsight
lts__t_sectors_op_read.sum = 63048
lts__t_sectors_op_write.sum = 780534

If you have any idea please let me know.
Working with nsight isn’t as that easy as I thought…

One notable difference between nvprof and Nsight Compute is that the latter automatically flushes all caches for each kernel replay iteration, in order to guarantee deterministic and consistent results. This impacts particularly L2 cache metrics. The next version of Nsight Compute will have a control to disable this cache flushing, at the cost of reduced result reproducibility.

As for the shared load transactions, you are comparing mismatching metrics, since smsp__inst_executed_op_shared_ld will give you the number of instructions executed, not the number of transactions. I think the metrics you are looking for are l1tex__data_pipe_lsu_wavefronts_mem_shared_op_ld.sum for load transactions and l1tex__data_pipe_lsu_wavefronts_mem_shared_op_st.sum for store transactions, respectively.

Regarding the atomic transactions, I have no immediate answer, other than that the metrics you collected look to be the correct ones.

Do you mean 2019.5?
May I know what is the option for that?

Changes are described in the release notes: https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html

I don’t see any effect for –cache-control all and –cache-control none
Please see the picture at https://pasteboard.co/IIvtrvd.png

–cache-control all
lts__t_sectors_op_read.sum = 62283
lts__t_sectors_op_write.sum = 780534

–cache-control none
lts__t_sectors_op_read.sum = 62247
lts__t_sectors_op_write.sum = 780534

I tried atomic with 2019.5 and got the same difference.
It is really bizarre. I can not find any reason for 2.1x difference.