I see that NsightCompute results are very different from nvprof and to be honest, the results aren’t reliable, IMO.
I did three test with the MatrixMul example. In all tests, I used the following command to multiply faily large matrices and the devices are 2080Ti and TitanV.
./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
For NsightCompute-2019.4, I ran the following commands to measure the IPC.
2080Ti:
$ CUDA_VISIBLE_DEVICES=1 /mnt/local/mnaderan/tools/NVIDIA-Nsight-Compute-2019.4/nv-nsight-cu-cli --quiet --metrics smsp__inst_executed.avg.per_cycle_active -f -o 2080ti.ipc ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce RTX 2080 Ti" with compute capability 7.5
MatrixA(2048,1024), MatrixB(1024,2048)
Computing result using CUDA Kernel...
done
Performance= 30.05 GFlop/s, Time= 142.942 msec, Size= 4294967296 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
TitanV:
$ CUDA_VISIBLE_DEVICES=0 /mnt/local/mnaderan/tools/NVIDIA-Nsight-Compute-2019.4/nv-nsight-cu-cli --quiet --metrics smsp__inst_executed.avg.per_cycle_active -f -o titanv.ipc ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "TITAN V" with compute capability 7.0
MatrixA(2048,1024), MatrixB(1024,2048)
Computing result using CUDA Kernel...
done
Performance= 9.47 GFlop/s, Time= 453.459 msec, Size= 4294967296 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
Since 2080Ti is not compatible with nvprof, I only ran that with TitanV with this command
$ CUDA_VISIBLE_DEVICES=0 ~/cuda-10.1.168/bin/nvprof --metrics ipc -f -o titanv.ipc.nvvp ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048 [Matrix Multiply Using CUDA] - Starting...
==2939== NVPROF is profiling process 2939, command: ./matrixMul -wA=2048 -hA=1024 -wB=1024 -hB=2048
GPU Device 0: "TITAN V" with compute capability 7.0
MatrixA(2048,1024), MatrixB(1024,2048)
Computing result using CUDA Kernel...
done
Performance= 36.08 GFlop/s, Time= 119.056 msec, Size= 4294967296 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==2939== Generated result file: /home/mnaderan/sdk/0_Simple/matrixMul/titanv.ipc.nvvp
Results are shown below with the pitures:
nsight → 2080Ti → IPC=0.18 Pasteboard - Uploaded Image
nsight → TitanV → IPC=0.39 Pasteboard - Uploaded Image
nvprof → TitanV → IPC=1.5 Pasteboard - Uploaded Image
Three files have been uploaded at Gofile - Free file sharing and storage platform
I see low IPC values for 2080 in other programs too. That is weird.
Any comment?