Ncu-ui not profiling some sections


I’m newbie in CUDA.

I start Nsight Compute as root using ncu-ui.
My code has 2 kernels.
The first one profiles correctly for all (!) sections, but the second one doesn’t profile for the following sections:

  • Instructions Statistics
  • Occupancy
  • Source Counters

With any of the above sections selected, I get:
The profiler returned an error code: 1 (0x1)

The first errors in the report are:

[Error] Rule Bottleneck returned an error:
Metric launch__waves_per_multiprocessor not found

[Error] <built-in function IAction_metric_by_name> returned a result with an error set

[Error] Rule Roofline Analysis returned an error: Metric
sm__sass_thread_inst_executed_op_ffma_pred_on.sum.peak_sustained not found

Is there something with my code prevents these section to profile?


Can you please provide details on your setup?

  • Nsight Compute version
  • which GPU
  • CUDA driver version
  • OS version

Sorry for late reply.

  • Nsight Compute 2020.2.1 (build 29181059)
  • RTX 2080 Super
  • Build cuda_11.1.TC455_06.29190527_0
  • Ubuntu 20.04.1

… and correction on posting #1. Section Occupancy works fine, so only Instructions Statistics and Source Counters produce error code 1.

Unfortunately, it’s not clear from your description why this wouldn’t work. If you can share your code with us for testing, we can debug the issue internally. Otherwise, please try the following steps:

  • Profile the application with the ncu command line interface with one of the offending sections, e.g.

    ncu --section InstructionStats -s 1 -c 1 my-app

to profile only the second kernel with this section.

  • If this also fails, you can try with the individual metrics listed in this section file. The file can be found in your ncu installation path in the “sections” directory. Select one or more of the metric names from this file, and collect them via the command line

    ncu --metrics smsp__inst_executed.sum,inst_executed -s 1 -c 1 my-app

  • If you identified one specific metric that causes problems, you can remove them from the section file as a WAR to get unblocked.

  • As an alternative, try collecting the data with application replay

    ncu --replay-mode application -s 1 -c 1 my-app

  • Are there any specific properties of this kernels, that could cause issues? Does it run especially long? Have you checked the kernel with e.g. compute-sanitizer for correctness?


Thanks a lot for the extensive reply.

I saw the LaunchFailed only now (see below). I have to address this one.

Sorry, but not allowed to post any code. I know … it would make it easier for me as well. I will work myself through all the points of your post above. Very useful!

# /usr/local/cuda-11.1/bin/ncu --section InstructionStats -s 1 -c 1 test_cuda
==PROF== Profiling "kernel_2" - 1 of 1: 0%....50%....100% - 2 passes
==ERROR== Error: LaunchFailed
==PROF== Disconnected from process 12114
==ERROR== An error occurred while trying to profile.
[12114] test_cuda@
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler                                                                      (!) n/a
Executed Instructions                                                                                         (!) n/a
Avg. Issued Instructions Per Scheduler                                                                        (!) n/a
Issued Instructions                                                                                           (!) n/a
---------------------------------------------------------------------- --------------- ------------------------------

Is it possible that this is due to a coding error?

# /usr/local/cuda-11.1/bin/compute-sanitizer test_cuda
========= ERROR SUMMARY: 0 errors

Regarding LaunchFailed

cudaErrorLaunchFailure An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointer and accessing out of bounds shared memory. The device cannot be used until cudaThreadExit() is called. All existing device memory allocations are invalid and must be reconstructed if the program is to continue using CUDA.

Fishing in the dark … :(

… later …

I have more info.

It seems that all turns around these values.

dim3 blocks  (9*1024,1,1);
dim3 threads (1024);

If I change the block to 8*1024, it always works.
If I change the block to 10*1024, it always fails.
9*1024 sometime works, sometimes fails.

This must be somehow an indication of a very specific problem.
I checked the code up and down … didn’t find anything wrong.
This must be CUDA related.

A passed run looks like this:

# /usr/local/cuda-11.1/bin/ncu --section InstructionStats -s 1 -c 1 test_cuda
==PROF== Profiling "kernel_test" - 1 of 1: 0%....50%....100% - 3 passes
==PROF== Disconnected from process 34479
[34479] test_cuda@
kernel_test(curandStateXORWOW*, int, res_struct*, int*, unsigned long long*), 2020-Nov-27 15:33:33, Context 1, Stream 7
Section: Instruction Statistics
---------------------------------------------------------------------- --------------- ------------------------------
Avg. Executed Instructions Per Scheduler                                          inst                     80.148.480
Executed Instructions                                                             inst                 15.388.508.160
Avg. Issued Instructions Per Scheduler                                            inst                  80.148.553,89
Issued Instructions                                                               inst                 15.388.522.347
---------------------------------------------------------------------- --------------- ------------------------------