NSight Compute not finding kernels

I’m encountering a problem similar to other threads where no definitive resolution was ever found: my system will find and profile CUDA kernels (python or otherwise) with nsys but not ncu. For example, following the ALCF stream benchmark tutorial, nsys works and reports that kernels exist:

$ nsys -v
NVIDIA Nsight Systems version 2023.2.3.1004-33186433v0
$ nsys profile --stats=true -t cuda ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.081068 s (=9933.704586 MBytes/sec)
Read: 0.000795 s (=1012682.471917 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1378238.901 0.00039     0.00039     0.00039
Mul         1343356.075 0.00040     0.00041     0.00040
Add         1365405.096 0.00059     0.00059     0.00059
Triad       1374085.843 0.00059     0.00059     0.00059
Dot         1338072.742 0.00040     0.00041     0.00041
Generating '/var/tmp/pbs.327750.sc5pbs-001-ib/nsys-report-27a8.qdstrm'
[1/6] [========================100%] report3.nsys-rep
[2/6] [========================100%] report3.sqlite
[3/6] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)    Max (ns)   StdDev (ns)                Name
 --------  ---------------  ---------  -----------  -----------  ---------  ----------  -----------  ---------------------------------
     58.9      196,051,613        401    488,906.8    524,707.0    385,636     588,752     96,540.4  cudaDeviceSynchronize
     36.5      121,424,138        103  1,178,875.1    403,723.0    396,596  27,189,162  4,497,385.9  cudaMemcpy
      2.0        6,814,547          2  3,407,273.5  3,407,273.5  3,314,304   3,500,243    131,478.7  cudaGetDeviceProperties_v2_v12000
      1.7        5,660,376          4  1,415,094.0    942,962.5    251,390   3,523,061  1,448,320.3  cudaMalloc
      0.6        1,872,954        501      3,738.4      2,978.0      2,627     264,191     11,741.0  cudaLaunchKernel
      0.3          930,521          4    232,630.3    223,240.5     70,699     413,341    140,318.0  cudaFree
      0.0            1,354          1      1,354.0      1,354.0      1,354       1,354          0.0  cuModuleGetLoadingMode

[4/6] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                             Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------
     25.0       58,246,173        100  582,461.7  582,527.0   581,279   583,775        479.3  void add_kernel<double>(const T1 *, const T1 *, T1 *)
     24.8       57,865,436        100  578,654.4  578,623.5   577,663   579,838        475.0  void triad_kernel<double>(T1 *, const T1 *, const T1 *)
     16.9       39,249,761        100  392,497.6  392,479.0   391,008   394,495        638.4  void mul_kernel<double>(T1 *, const T1 *)
     16.6       38,734,716        100  387,347.2  387,375.0   384,256   391,519      1,509.0  void dot_kernel<double>(const T1 *, const T1 *, T1 *, int)
     16.4       38,270,655        100  382,706.6  382,624.0   381,087   393,472      1,234.3  void copy_kernel<double>(const T1 *, T1 *)
      0.2          521,759          1  521,759.0  521,759.0   521,759   521,759          0.0  void init_kernel<double>(T1 *, T1 *, T1 *, T1, T1, T1)

[5/6] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)  Min (ns)   Max (ns)   StdDev (ns)      Operation
 --------  ---------------  -----  ---------  --------  --------  ----------  -----------  ------------------
    100.0       80,716,337    103  783,653.8   2,304.0     1,792  27,002,161  4,533,035.9  [CUDA memcpy DtoH]

[6/6] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
    805.652    103     7.822     0.003     0.003   268.435       45.360  [CUDA memcpy DtoH]
 [...]

… but ncu does not. Using a freshly downloaded version of ncu (version 2023.2.2.0, installed at the system level, does the same):

$ ~/nsight/ncu -v
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.1.0 (build 34702747) (public-release)
$ ~/nsight/ncu ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079844 s (=10086.003737 MBytes/sec)
Read: 0.000726 s (=1109384.116908 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1405071.807 0.00038     0.00039     0.00039
Mul         1361593.605 0.00039     0.00040     0.00040
Add         1364852.022 0.00059     0.00060     0.00059
Triad       1371100.648 0.00059     0.00060     0.00059
Dot         1351499.246 0.00040     0.00041     0.00040
==WARNING== No kernels were profiled.

System device information (note that this is a shared, centrally-managed system, and I have no ready access to either sudo or driver updates):

$ nvidia-smi
Tue Sep 24 15:29:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:17:00.0 Off |                    0 |
| N/A   36C    P0              79W / 400W |      4MiB / 40960MiB |     86%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:31:00.0 Off |                    0 |
| N/A   39C    P0              53W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:B1:00.0 Off |                    0 |
| N/A   55C    P0             343W / 400W |  30144MiB / 40960MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   47C    P0             113W / 400W |  38834MiB / 40960MiB |     80%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
$ echo $CUDA_VISIBLE_DEVICES
GPU-74927176-aee1-5f59-0810-869856abe095
$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-787fd111-7bc2-a11d-fab3-c02ed8a14e17)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-74927176-aee1-5f59-0810-869856abe095)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-0f8b9778-7976-890c-cc17-ef350b6e72de)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-c8b3fcd7-1317-bd52-8ba0-ce108833882a)

My ultimate goal is to compute flops (or better yet roofline plots) for a particular jax model under a various set of hyperparameter options. Unfortunately, Jax doesn’t provide valid flops estimates for kernels that use cublas (so all of the interesting ones), so I’m left with runtime, pratcial monitoring via cupti events.

Hi, @csubich

Sorry for the issue you met.
Can you confirm if ncu works for other simple CUDA sample ?

No, ncu does not work for other cuda samples. My final intended use is a more complex Python code, but I picked up the ALCF tutorial code for this case to reduce the possibility of user error.

Hi, @csubich

$ ~/nsight/ncu -v
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.1.0 (build 34702747) (public-release)

Your output shows the NCU version is 2024.3.1.0. Can you double confirm 2023.2.2.0 has the same issue ?

Yes, I can double confirm that 2023.2.2.0 behaves the same way:

$ which ncu
/usr/local/cuda/bin/ncu
$ ncu -v
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2023.2.2.0 (build 33188574) (public-release)
$ ncu ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079728 s (=10100.622983 MBytes/sec)
Read: 0.000773 s (=1041366.435757 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1404954.144 0.00038     0.00039     0.00038
Mul         1361262.176 0.00039     0.00040     0.00040
Add         1363959.717 0.00059     0.00060     0.00059
Triad       1369047.111 0.00059     0.00060     0.00059
Dot         1357266.692 0.00040     0.00041     0.00040
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I checked against the currently-newest 2024.3.1.0 to help rule out any already-fixed bugs that might have been triggered.

Hi, @csubich

Thanks for the update. NCU 2023.2.2.0+ Driver 535.104.05+A100 works in our internal ENV, it has been tested for multi times. So it would be hard to tell why it fails in your environment.

Is there any other special setting ? How do you share the machine ? Are you running in docker or VM ? Has it ever worked with other version ?

You can create a file with below content and setting the NVLOG_CONFIG_FILE environment variable to this file.
Then run with “ncu”, you’ll get log recorded in /tmp/nvlog.log

$ /tmp/nvlog.log
UseStdout
ForceFlush
Format |$time|$sev:${level:-3}|$proc|$tid|$name>> $text

  • 0i 0w 100ef 0IW 100EF global
    -100i 100w 100ef 0IW 100EF regop_tgt_dta

Is there any other special setting ? How do you share the machine

The machine is shared via a PBS Professional queuing system. The examples I’ve shown so far have all been from interactive jobs. I’m not aware of any system configuration out of the ordinary.

Are you running in docker or VM ?

No, the execution environment is not virtualized.

Has it ever worked with other version ?

I’ve not tried with other versions, so I don’t know if there was a historic time that this might have worked.

You can create a file with below content and setting the NVLOG_CONFIG_FILE environment variable to this file. Then run with “ncu”, you’ll get log recorded in /tmp/nvlog.log

Your formatting is garbled here. Please use Markdown’s literal representation for code-like items such as this. Knowing now to search for nvlog.config, I found a template in the nsight-compute distribution that I have used; its contents are included in the following transcript.

$ cp ~/nsight/host/target-linux-x64/nvlog.config.template ./nvlog.config
$ export NVLOG_CONFIG_FILE=$(pwd)/nvlog.config
$ cat nvlog.config
# Rename this file to nvlog.config and put into one of the two places:
#   * Next to the nsys-ui binary.
#   * Into your current directory.
                                       
# Enable all loggers:
+ 75iw 75ef 0IW 0EF   global
# Except for too verbose ones                           
- quadd_verbose_                           
                                             
# Append logs to a file:                               
$ nsys-ui.log                                          
                                                       
# Flush the log file after every log message:          
ForceFlush                                             
                                                       
# On Windows, use OutputDebugString():
OutputDebugString                                                                                    
                                             
# Log into stderr:                                                                                                  
UseStderr                    
      
# Specify logging format:                                                                      
# Simple format                                                                                
# Format $time $tid $name $text                                                                                                       

# A more verbose variant of logging format:
Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]: $text
$ ncu ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079682 s (=10106.451581 MBytes/sec)
Read: 0.000721 s (=1116398.673031 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1395241.814 0.00038     0.00039     0.00039
Mul         1357345.617 0.00040     0.00040     0.00040
Add         1370426.335 0.00059     0.00059     0.00059
Triad       1379641.443 0.00058     0.00059     0.00059
Dot         1355594.274 0.00040     0.00041     0.00040
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
$ cat nsys-ui.log
I12:14:11:465|TPS_Comms|1793889|:10[]: Creating AsioAsyncActionProcessor - Background 0x7c108b0
I12:14:11:465|TPS_Comms|1793889|:10[]: Creating AsioAsyncActionProcessor - Background 0x7c41120
I12:14:11:466|CmdlineProfiler|1793889|:2518[]: Using temporary file /var/tmp/pbs.347468.sc5pbs-001-ib/nsight-compute-174f-4024.ncu-rep
W12:14:11:508|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:508|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:509|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:509|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:509|NvRules|1793889|:511[]: Failed to add python file /home/csu001/Documents/NVIDIA Nsight Compute/2023.2.2/Sections/NvRules.py as rule
W12:14:11:510|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:510|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:512|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:512|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:513|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:513|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:515|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:515|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:517|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:517|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:518|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:518|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:519|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:519|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:521|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:521|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:522|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:522|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:524|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:524|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:526|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:526|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:527|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:527|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:529|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:529|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:532|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:532|NvRules|1793889|:245[]: QueryKind returned error 3
I12:14:11:532|CmdlineProfiler|1793889|:228[]: Expanded 0 macros in file name
I12:14:11:532|pipe_utilities|1793889|:19[]: Contruct PipeDescriptor
I12:14:11:532|pipe_utilities|1793889|:36[]: Contruct Pipe
I12:14:11:532|pipe_utilities|1793889|:19[]: Contruct PipeDescriptor
I12:14:11:532|pipe_utilities|1793889|:57[]: Create Pipe from name (isOwner: 1)
I12:14:11:532|pipe_utilities|1793889|:26[]: Contruct PipeDescriptor (isOwner: 1)
I12:14:11:532|pipe_utilities|1793889|:67[]: Creating pipe /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0
I12:14:11:532|pipe_utilities|1793889|:91[]: Opened pipe fd /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0-ownerReadingStream for readi
ng (8)
I12:14:11:532|pipe_utilities|1793889|:96[]: Opened pipe fd /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0-ownerWritingStream for writi
ng (9)
I12:14:11:533|pipe_utilities|1793911|:36[]: Contruct Pipe
I12:14:11:533|pipe_utilities|1793911|:19[]: Contruct PipeDescriptor
I12:14:11:533|pipe_utilities|1793911|:57[]: Create Pipe from name (isOwner: 0)
I12:14:11:533|pipe_utilities|1793911|:26[]: Contruct PipeDescriptor (isOwner: 0)
I12:14:11:533|pipe_utilities|1793911|:91[]: Opened pipe fd /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0-ownerWritingStream for readi
ng (10)
I12:14:11:533|pipe_utilities|1793911|:96[]: Opened pipe fd /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0-ownerReadingStream for writi
ng (11)
I12:14:11:534|pipe_utilities|1793911|:19[]: Contruct PipeDescriptor
I12:14:11:535|pipe_utilities|1793911|:19[]: Contruct PipeDescriptor
I12:14:12:673|CmdlineProfiler|1793889|:1307[]: End
I12:14:12:673|CmdlineProfiler|1793889|:1331[]: Profiler shutdown requested
I12:14:12:673|pipe_utilities|1793889|:110[]: Close Pipe (isOwner: 1)

I12:14:12:673|pipe_utilities|1793889|:19[]: Contruct PipeDescriptor
I12:14:12:673|pipe_utilities|1793889|:45[]: Reset Pipe (closePipe: 1)
I12:14:12:674|CmdlineProfiler|1793889|:39[]: Destroying KernelReplayHelper
I12:14:12:674|CmdlineProfiler|1793889|:271[]: Destroying ProfileHelper
I12:14:12:674|TPS_Session|1793889|:151[]: ===== SessionManager async close all sessions. NumSessions = 0 =====
I12:14:12:674|TPS_Session|1793889|:177[]: SessionManager running the ActionProcessor waiting for sessions to complete AsyncClose.
I12:14:12:674|TPS_Session|1793889|:183[]: ===== SessionManager async close completed =====
I12:14:12:674|TPS_Comms|1793889|:17[]: Destroying AsioAsyncActionProcessor - Background - 0x7c41120
I12:14:12:674|TPS_Comms|1793889|:27[]: AsioAsyncActionProcessor - Background - Stopping
I12:14:12:674|TPS_Comms|1793889|:17[]: Destructing the action processor 0x7c41120
I12:14:12:674|TPS_Comms|1793889|:29[]: Stopping the action processor 0x7c41120
I12:14:12:674|TPS_Comms|1793889|:17[]: Destroying AsioAsyncActionProcessor - Background - 0x7c108b0
I12:14:12:674|TPS_Comms|1793889|:27[]: AsioAsyncActionProcessor - Background - Stopping
I12:14:12:674|TPS_Comms|1793889|:17[]: Destructing the action processor 0x7c108b0
I12:14:12:674|TPS_Comms|1793889|:29[]: Stopping the action processor 0x7c108b0
I12:14:12:674|TPS_Comms|1793889|:28[]: Stopping foreground action processor.
I12:14:12:674|TPS_Comms|1793889|:17[]: Destructing the action processor 0x7c38150
I12:14:12:674|TPS_Comms|1793889|:29[]: Stopping the action processor 0x7c38150

Another attempt with my reconstruction of your suggested nvlog.config file gives no logging output:

$ cat nvlog2.config # Log file listing
$ /tmp/nvlog.log
UseStdout
ForceFlush
Format |$time|$sev:${level:-3}|$proc|$tid|$name>> $text
+ 0i 0w 100ef 0IW 100EF global
-100i 100w 100ef 0IW 100EF regop_tgt_dta
$ export NVLOG_CONFIG_FILE=$(pwd)/nvlog2.config                                                                 
$ ncu --target-processes all ./cuda-stream                                                                      
BabelStream                                                 
Version: 5.0    
Implementation: CUDA
Running kernels 100 times
Precision: double                                      
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)          
Using CUDA device NVIDIA A100-SXM4-40GB                                                                                                                  
Driver: 12020                                               
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079660 s (=10109.295369 MBytes/sec)
Read: 0.000725 s (=1110064.618400 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1404237.559 0.00038     0.00039     0.00038
Mul         1360389.494 0.00039     0.00040     0.00040
Add         1369261.268 0.00059     0.00060     0.00059
Triad       1376688.568 0.00058     0.00060     0.00059
Dot         1352772.874 0.00040     0.00041     0.00040
==WARNING== No kernels were profiled.                   
$ ls -ls /tmp/nvlog.log
0 -rw-r--r-- 1 csu001 eccc_rpnatm 0 Sep 27 12:21 /tmp/nvlog.log
$ cat /tmp/nvlog.log # Blank

It’s now been a week of radio silence on this topic, is there no suggestion for further investigation/debugging?

Hi, @csubich

Sorry for the late response as we have a public holiday.
I have just pasted your log for our dev to check. Will let you know if there is any response.

1 Like

Hi, @csubich

Which Linux distribution are you running and do you know if you run in Secure Execution Mode ?
As LD_PRELOAD is known to not work in Secure Execution Mode, which may cause this problem.

The distribution is Red Hat Enterprise 8.3, and as far as I can tell secure execution mode is not enabled.