Hi, we’re attempting to profile some kernels we have to improve the performance a bit. In the past, I’ve spent quite a bit of time invested in nvprof/nvvp, and have watched nsight tools progress over the years passively. Since our workflow is all containers in Kubernetes, the only way we can really visually profile is to dump the output using the CLI tools, and import them in the graphical tools on another machine.
I’m not sure when this started happening, but I did the normal profiling using nvprof we do with :
__PREFETCH=off nvprof --analysis-metrics -f -o output.prof cmd
This command now gives a SIGPIPE when running, and never completes a profiling session successfully. Given that I know nvprof/nvvp are being deprecated in the future, I started playing with nsight-systems and nsight-compute. I began by using “nsys profile” with:
nsys profile --stats=true cmd
This completes successfully and drops a qdrep file that I can load in nsys-ui. Once this file is loaded and I click to analyze the kernel, it asks for where ncu-is located, so I pointed it to the place with the binary, and it says:
They are both the most recent versions (2020.1.2 compute, and 2020.3.4 systems), so I’d think the integration should be supported. Next, I tried using nsight-compute command line directly along with the UI to load the file:
ncu -o profile --metrics "regex:.*” cmd
It drops a ncu-rep file, which I then load into ncu-ui. However, that shows nothing profiled on the Details view:
This appears to be a bug someone reported last year, and they said on the forums it was fixed:
However, I’m running the latest and it doesn’t appear to work. I also tried a subset of the metrics, and that too didn’t show anything in the Details view. Next, I tried simply looking at what metrics were available on this GPU, and the list seems far too short for a V100:
root@fi-gcomp016:/# /opt/nvidia/nsight-compute/2020.1.2/ncu --list-metrics
sm__warps_active.avg.per_cycle_active
sm__warps_active.avg.pct_of_peak_sustained_active
sm__throughput.avg.pct_of_peak_sustained_elapsed
sm__maximum_warps_per_active_cycle_pct
sm__maximum_warps_avg_per_active_cycle
sm__cycles_active.avg
lts__throughput.avg.pct_of_peak_sustained_elapsed
launch__waves_per_multiprocessor
launch__thread_count
launch__shared_mem_per_block_static
launch__shared_mem_per_block_dynamic
launch__shared_mem_per_block_driver
launch__shared_mem_per_block
launch__shared_mem_config_size
launch__registers_per_thread
launch__occupancy_per_shared_mem_size
launch__occupancy_per_register_count
launch__occupancy_per_block_size
launch__occupancy_limit_warps
launch__occupancy_limit_shared_mem
launch__occupancy_limit_registers
launch__occupancy_limit_blocks
launch__grid_size
launch__block_size
l1tex__throughput.avg.pct_of_peak_sustained_active
gpu__time_duration.sum
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
-arch:75:80:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
-arch:40:70:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed
gpc__cycles_elapsed.max
gpc__cycles_elapsed.avg.per_second
dram__cycles_elapsed.avg.per_second
-arch:75:80:dram__cycles_elapsed.avg.per_second
-arch:40:70:dram__cycles_elapsed.avg.per_second
breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed
breakdown:gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed
Even running a simple example gives the same results on the input, and even fails ncu with an internal error:
root@fi-gcomp016:/nfs/samples/7_CUDALibraries/simpleCUBLAS# /opt/nvidia/nsight-compute/2020.1.2/ncu -o test --set full ./simpleCUBLAS
==PROF== Connected to process 1382 (/nfs/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS)
GPU Device 0: “Volta” with compute capability 7.0
simpleCUBLAS test running…
==PROF== Profiling “volta_sgemm_32x32_sliced1x4_nn” - 1: 0%…50%…100% - 3 passes
==ERROR== Error: InternalError
simpleCUBLAS test passed.
==PROF== Disconnected from process 1382
==ERROR== An error occurred while trying to profile.
==PROF== Report: /nfs/samples/7_CUDALibraries/simpleCUBLAS/test.ncu-rep
Any idea what’s wrong here?