Can't Get NCU GUI To Import Properly

Hi, we’re attempting to profile some kernels we have to improve the performance a bit. In the past, I’ve spent quite a bit of time invested in nvprof/nvvp, and have watched nsight tools progress over the years passively. Since our workflow is all containers in Kubernetes, the only way we can really visually profile is to dump the output using the CLI tools, and import them in the graphical tools on another machine.

I’m not sure when this started happening, but I did the normal profiling using nvprof we do with :

__PREFETCH=off nvprof --analysis-metrics -f -o output.prof cmd

This command now gives a SIGPIPE when running, and never completes a profiling session successfully. Given that I know nvprof/nvvp are being deprecated in the future, I started playing with nsight-systems and nsight-compute. I began by using “nsys profile” with:

nsys profile --stats=true cmd

This completes successfully and drops a qdrep file that I can load in nsys-ui. Once this file is loaded and I click to analyze the kernel, it asks for where ncu-is located, so I pointed it to the place with the binary, and it says:

pic1

They are both the most recent versions (2020.1.2 compute, and 2020.3.4 systems), so I’d think the integration should be supported. Next, I tried using nsight-compute command line directly along with the UI to load the file:

ncu -o profile --metrics "regex:.*” cmd

It drops a ncu-rep file, which I then load into ncu-ui. However, that shows nothing profiled on the Details view:

This appears to be a bug someone reported last year, and they said on the forums it was fixed:

However, I’m running the latest and it doesn’t appear to work. I also tried a subset of the metrics, and that too didn’t show anything in the Details view. Next, I tried simply looking at what metrics were available on this GPU, and the list seems far too short for a V100:

root@fi-gcomp016:/# /opt/nvidia/nsight-compute/2020.1.2/ncu --list-metrics

sm__warps_active.avg.per_cycle_active

sm__warps_active.avg.pct_of_peak_sustained_active

sm__throughput.avg.pct_of_peak_sustained_elapsed

sm__maximum_warps_per_active_cycle_pct

sm__maximum_warps_avg_per_active_cycle

sm__cycles_active.avg

lts__throughput.avg.pct_of_peak_sustained_elapsed

launch__waves_per_multiprocessor

launch__thread_count

launch__shared_mem_per_block_static

launch__shared_mem_per_block_dynamic

launch__shared_mem_per_block_driver

launch__shared_mem_per_block

launch__shared_mem_config_size

launch__registers_per_thread

launch__occupancy_per_shared_mem_size

launch__occupancy_per_register_count

launch__occupancy_per_block_size

launch__occupancy_limit_warps

launch__occupancy_limit_shared_mem

launch__occupancy_limit_registers

launch__occupancy_limit_blocks

launch__grid_size

launch__block_size

l1tex__throughput.avg.pct_of_peak_sustained_active

gpu__time_duration.sum

gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed

-arch:75:80:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed

-arch:40:70:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed

gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed

gpc__cycles_elapsed.max

gpc__cycles_elapsed.avg.per_second

dram__cycles_elapsed.avg.per_second

-arch:75:80:dram__cycles_elapsed.avg.per_second

-arch:40:70:dram__cycles_elapsed.avg.per_second

breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed

breakdown:gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed

Even running a simple example gives the same results on the input, and even fails ncu with an internal error:

root@fi-gcomp016:/nfs/samples/7_CUDALibraries/simpleCUBLAS# /opt/nvidia/nsight-compute/2020.1.2/ncu -o test --set full ./simpleCUBLAS

==PROF== Connected to process 1382 (/nfs/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS)

GPU Device 0: “Volta” with compute capability 7.0

simpleCUBLAS test running…

==PROF== Profiling “volta_sgemm_32x32_sliced1x4_nn” - 1: 0%…50%…100% - 3 passes

==ERROR== Error: InternalError

simpleCUBLAS test passed.

==PROF== Disconnected from process 1382

==ERROR== An error occurred while trying to profile.

==PROF== Report: /nfs/samples/7_CUDALibraries/simpleCUBLAS/test.ncu-rep

Any idea what’s wrong here?

Note that nvprof does not support GPU performance metric collection on GeForce GTX 1660 (which is Turing architecture GPU). So “nvprof --analysis-metrics” is not expected to work. Refer the Migrating to Nsight Tools from Visual Profiler and nvprof section in the profiler document.
But nvprof should report an error & not crash. Can you please provide the nvprof version (output of “nvprof --version”)?

Hi @Sanjiv.Satoor, I think we can ignore the nvprof part of my post since we’re 100% Volta and Ampere at this point, and I don’t want to spend time on deprecated tools. I don’t have a GTX card; that pasting was a forum link to someone else with the same problem that had a GTX card.

I followed the instructions in that document and still could not get the profiler to work. Note that I’m running this in a container and all normally CUDA applications work on the GPU, but only the profiler doesn’t work.

I would recommend not to try collecting all possible metrics at once. Not only will the incurred overhead be very high, it also increases the chance that any potentially not working sub-metric might break profiling. Instead, either choose one of the curated sets and/or sections (like you tried later), or select individual, smaller sets of metrics.

–list-metrics will not show you which metrics are available on the GPU. It lists which metrics are currently selected for profiling. Without other options, this will include the metrics of the default set. When choosing a different combination, e.g. “–section SpeedOfLight --metrics inst_executed --list-metrics”, it will list the metrics included in this combination.
To see the metrics available on the current, or any supported, GPU, use the --query-metrics option. It can be used together with --chips to select a specific chip (GPU), rather than the current device.

That is certainly not the expected behavior. I tried profiling this sample with the same Nsight Compute version and parameters, on a GV100, but I am not seeing the error you are reporting. It is also unexpected, that running the “full” set would only require the profiled kernel to replay 3 times (“3 passes”). On this GPU, it should require ~75 passes.

Can you please check the following

  • Which display driver version are you using? (provide e.g. the output of nvidia-smi) If it’s old, consider upgrading to the latest version.

  • If you can try this, can you profile the same sample outside of the container?

  • Does your system have profiling permissions enabled in the driver?
    For both question above, you can refer to https://developer.nvidia.com/blog/using-nsight-compute-in-containers/ for more details.

  • Are you using a “clean” installation of Nsight Compute? E.g., I would expect the following (shortened) output

    /opt/nvidia/nsight-compute/2020.1.2/ncu --set full --list-metrics
    thread_inst_executed_true
    smsp__warps_eligible.avg.per_cycle_active
    smsp__warps_active.avg.per_cycle_active
    smsp__warps_active.avg.peak_sustained
    smsp__thread_inst_executed_pred_on_per_inst_executed.ratio
    smsp__thread_inst_executed_per_inst_executed.ratio
    smsp__sass_thread_inst_executed_op_fmul_pred_on.sum.per_cycle_elapsed
    smsp__sass_thread_inst_executed_op_ffma_pred_on.sum.per_cycle_elapsed
    smsp__sass_thread_inst_executed_op_fadd_pred_on.sum.per_cycle_elapsed
    smsp__sass_thread_inst_executed_op_dmul_pred_on.sum.per_cycle_elapsed
    smsp__sass_thread_inst_executed_op_dfma_pred_on.sum.per_cycle_elapsed
    smsp__sass_thread_inst_executed_op_dadd_pred_on.sum.per_cycle_elapsed
    smsp__pcsamp_warps_issue_stalled_wait_not_issued

    group:smsp__pcsamp_warp_stall_reasons_not_issued
    group:smsp__pcsamp_warp_stall_reasons
    group:memory__shared_table
    group:memory__l2_cache_table
    group:memory__first_level_cache_table
    group:memory__dram_table
    group:memory__chart
    gpu__time_duration.sum
    gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
    -arch:75:80:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
    -arch:40:70:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
    gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed
    gpu__compute_memory_request_throughput.avg.pct_of_peak_sustained_elapsed
    gpu__compute_memory_access_throughput.avg.pct_of_peak_sustained_elapsed
    gpc__cycles_elapsed.max
    gpc__cycles_elapsed.avg.per_second
    dram__cycles_elapsed.avg.per_second

    dram__bytes.sum.peak_sustained
    breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed
    breakdown:gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed

Thanks @felix_dt, our lab is down at the moment, but when it comes back up I will try your suggestions.

@felix_dt I was able to get it working better on an A100. I will go back and see what the V100 was having issues with. The A100 driver version is 450.51.09.

@felix_dt I tried used the default section of metrics to get something close to nvprof’s defaults, but the profiler seems to run endlessly on the same kernel printing this:

==PROF== Profiling “Complex64ToPlanar” - 939: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 940: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 941: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 942: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 943: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 944: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 945: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 946: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 947: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 948: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 949: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 950: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 951: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 952: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 953: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 954: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 955: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 956: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 957: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 958: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 959: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 960: 0%…50%…100% - 13 passes
==PROF== Profiling “Complex64ToPlanar” - 961: 0%…50%…100% - 13 passes

Have you seen this before?

Note that nvprof by default only traces the applications activities, i.e. API calls and GPU kernels, memcpy, etc. This means it captures a low-overhead timeline of these, which is very different from Nsight Compute’s default, which collects performance metrics for individual CUDA kernels. If you would want the same CUDA tracing that nvprof had, you need to switch to Nsight Systems.

As for the output you are seeing, the tool is not profiling the same kernel multiple times, but all instances of this kernel type (i.e. the Complex64ToPlanar kernel seems to be called at least 961 times in your application). You can use various flags to limit what is captured, e.g. -c for the kernel count, -s to skip instances, -k or --kernel-id to filter by kernel name, etc.
For example

-s 10 -c 100 -k Complex64ToPlanar

would profile 100 instances of all kernels matching “Complex64ToPlanar”, after skipping the first 10 instances.

--kernel-id :::1

would profile the first instance of all kernel types in the application.

The documentation and --help should have further the details.

Thanks @felix_dt, I will try nsight systems. When I was reading the documentation my understanding was that nsight systems was more for CPU profiling with a small amount of GPU emphasis. It sounds like it’s a good starting point to replace nvprof/nvvp, so I’ll start there.