Running ncu with SourceCounters returns an error: Metric smsp__pcsamp_sample_count not found

GiPL · August 5, 2023, 4:02pm

Hello,

I am trying to profile my application using ncu with the SourceCounters section enabled, as I am interested in the stall reasons. However, running the command sudo ncu --section SourceCounters myApplication Shows the following errors:

    ERR   Rule PC sampling data returned an error: Metric smsp__pcsamp_sample_count not found                           
    ----- --------------------------------------------------------------------------------------------------------------
    ERR   <built-in function IAction_metric_by_name> returned a result with an exception set                            
          /home/myuser/Documents/NVIDIA Nsight Compute/2023.2.1/Sections/PCSamplingData.py:47                      
          /usr/local/NVIDIA-Nsight-Compute-2023.2/target/linux-desktop-glibc_2_11_3-x64/../../sections/NvRules.py:2017  
    ----- --------------------------------------------------------------------------------------------------------------
    ERR   Rule Uncoalesced Global Accesses returned an error: Metric memory_l2_theoretical_sectors_global not found     
    ----- --------------------------------------------------------------------------------------------------------------
    ERR   <built-in function IAction_metric_by_name> returned a result with an exception set                            
          /home/myuser/Documents/NVIDIA Nsight Compute/2023.2.1/Sections/UncoalescedAccess.py:70                   
          /usr/local/NVIDIA-Nsight-Compute-2023.2/target/linux-desktop-glibc_2_11_3-x64/../../sections/NvRules.py:2017  
    ----- --------------------------------------------------------------------------------------------------------------
    ERR   Rule Uncoalesced Shared Accesses returned an error: Metric memory_l1_wavefronts_shared not found              
    ----- --------------------------------------------------------------------------------------------------------------
    ERR   <built-in function IAction_metric_by_name> returned a result with an exception set                            
          /home/myuser/Documents/NVIDIA Nsight Compute/2023.2.1/Sections/UncoalescedSharedAccess.py:70             
          /usr/local/NVIDIA-Nsight-Compute-2023.2/target/linux-desktop-glibc_2_11_3-x64/../../sections/NvRules.py:2017

Interestingly, the command sudo ncu --section SourceCounters myApplication --print-summary per-gpu results in a segmentation fault (of ncu). In addition,sudo ncu --section SourceCounters myApplication --print-summary per-gpu --graph-profiling graph also segfaults, but only after a while (around 80 kernel launches). My application does not use CUDA graphs.

I am running ncu Version 2023.2.1.0 (build 33050884) (public-release) on Linux Mint 21 x86_64 with kernel 5.15.0-78-generic. The device is an RTX 3070.

I have attached the complete output of the first command.
output_of_ncu.txt (4.9 KB)

Any tips on how to resolve this issue? I have already tried reinstalling ncu a few times, but I fear some files are still lingering on my system and getting mixed up.

Thank you for your time.

felix_dt · August 6, 2023, 5:36pm

The errors you are showing in your description are only a symptom of the problems encountered while profiling the third kernel in your log, which subsequently caused several metrics expected by these rules to be not available. The underlying error is

==ERROR== An error was reported by the driver

which indicates that the kernel hit a GPU exception (like an illegal memory access) while being replayed by the tool. There can be multiple reasons for this, with the most likely being:

An existing bug in the kernel that is triggered when being run under the profiler. You would want to run the different tools provided by compute-sanitizer on this kernel to ensure that’s not the case.
An issue introduced by the software-patching metrics that are part of the SourceCounters section.
Another problem in the combination of GPU, CUDA driver and tool. It will be useful if you can let us know your exact CUDA driver version (e.g. by providing the output of nvidia-smi)

Since you are looking for the stall reasons, which are collected using a different internal metric provider and not using software patching, you could request these independently from the rest to WAR the issue, e.g. by using a new section file in user user’s documents dir at /home/user/Documents/NVIDIA Nsight Compute/<version>/Sections with the following content and collecting only that:

Identifier: "ReducedSourceCounters"
DisplayName: "Reduced Source Counters"
Description: "none"
Order: 100

SourceMetrics {
  Metrics {
    Label: "Warp Stall Sampling (All Samples)"
    Name: "group:smsp__pcsamp_warp_stall_reasons"
  }
  Metrics {
    Label: "Warp Stall Sampling (Not-issued Samples)"
    Name: "group:smsp__pcsamp_warp_stall_reasons_not_issued"
  }
}

and then opening this report in the UI. This will give you samples warp stalls, correlated with your source code. Since in your original command, you used --print-summary, it’s also possible you are only looking for values aggregated across the runtime of your kernel, in which case you would collect

ncu --section WarpStateStats …

instead. If you want to see the chart generated by this section on the command line, you’ll also have to use --print-details all.

If your application doesn’t use CUDA graph, using the option --graph-profiling graph won’t have any impact and is unlikely to trigger any errors. It’s however possible that the issue is non-deterministic and may occur at different kernel instances at different runs.

GiPL · August 7, 2023, 8:29am

Hello Felix,

Thank you for your elaborate response. Here’s the nvidia-smi output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070        On  | 00000000:0A:00.0  On |                  N/A |
|  0%   42C    P8              24W / 220W |    408MiB /  8192MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

The Application returns no errors when run without profiler. I check all API call return values and compute sanitizer returns 0 errors. Is it possible to let Nsight Compute produce a core dump of the failed application? It does not appear to do so by default, or coredumpctl can not find it.

Thank you for your help on still getting the counters out. I can at least move forward :)

felix_dt · August 7, 2023, 8:49am

Unfortunately, that’s not supported.

compute sanitizer returns 0 errors
Please make sure to also check its --tool racecheck sub-tool (by default, only memcheck is run, which checks for a different problem class).

GiPL · August 7, 2023, 9:07am

This also does not return any errors or warnings. So the error must be introduced by the profiler, but not the sanitizer? I assume they have a bit of a similar functioning to observe the execution.

felix_dt · August 7, 2023, 9:14am

Yes, it’s possible that this is a profiler bug. They work quite differently, as they have different purposes and requirements.

Topic		Replies	Views
Ncu-ui not profiling some sections Nsight Compute	4	2361	November 26, 2020
Some metric set and section are not enable Nsight Compute cuda , ubuntu	5	1510	January 16, 2024
Nsight Compute command line does not profile Source Counters Nsight Compute	4	1034	March 15, 2022
NVIDIA NSight Compute: The profiler returned an error code:1 Nsight Compute	13	1908	March 18, 2024
How to get speed of light with ncu-cli Nsight Compute profiling	8	828	March 23, 2024
Metrics smsp__sass_thread_inst_executed_op* returns n/a Nsight Compute	8	1773	August 2, 2019
Run ncu command in ubuntu 20.04 Nsight Compute	7	5338	August 8, 2022
cannot profile the application Nsight Compute	2	863	October 26, 2019
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1427	July 27, 2023
==ERROR== Failed to prepare kernel for profiling (0xc00000fd) but CUDA sample works Nsight Compute kernel , nvbugs	13	2051	November 6, 2021

Running ncu with SourceCounters returns an error: Metric smsp__pcsamp_sample_count not found

Related topics