NSight Compute not finding kernels

csubich · September 24, 2024, 7:34pm

I’m encountering a problem similar to other threads where no definitive resolution was ever found: my system will find and profile CUDA kernels (python or otherwise) with nsys but not ncu. For example, following the ALCF stream benchmark tutorial, nsys works and reports that kernels exist:

$ nsys -v
NVIDIA Nsight Systems version 2023.2.3.1004-33186433v0
$ nsys profile --stats=true -t cuda ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.081068 s (=9933.704586 MBytes/sec)
Read: 0.000795 s (=1012682.471917 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1378238.901 0.00039     0.00039     0.00039
Mul         1343356.075 0.00040     0.00041     0.00040
Add         1365405.096 0.00059     0.00059     0.00059
Triad       1374085.843 0.00059     0.00059     0.00059
Dot         1338072.742 0.00040     0.00041     0.00041
Generating '/var/tmp/pbs.327750.sc5pbs-001-ib/nsys-report-27a8.qdstrm'
[1/6] [========================100%] report3.nsys-rep
[2/6] [========================100%] report3.sqlite
[3/6] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls   Avg (ns)     Med (ns)    Min (ns)    Max (ns)   StdDev (ns)                Name
 --------  ---------------  ---------  -----------  -----------  ---------  ----------  -----------  ---------------------------------
     58.9      196,051,613        401    488,906.8    524,707.0    385,636     588,752     96,540.4  cudaDeviceSynchronize
     36.5      121,424,138        103  1,178,875.1    403,723.0    396,596  27,189,162  4,497,385.9  cudaMemcpy
      2.0        6,814,547          2  3,407,273.5  3,407,273.5  3,314,304   3,500,243    131,478.7  cudaGetDeviceProperties_v2_v12000
      1.7        5,660,376          4  1,415,094.0    942,962.5    251,390   3,523,061  1,448,320.3  cudaMalloc
      0.6        1,872,954        501      3,738.4      2,978.0      2,627     264,191     11,741.0  cudaLaunchKernel
      0.3          930,521          4    232,630.3    223,240.5     70,699     413,341    140,318.0  cudaFree
      0.0            1,354          1      1,354.0      1,354.0      1,354       1,354          0.0  cuModuleGetLoadingMode

[4/6] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                             Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------
     25.0       58,246,173        100  582,461.7  582,527.0   581,279   583,775        479.3  void add_kernel<double>(const T1 *, const T1 *, T1 *)
     24.8       57,865,436        100  578,654.4  578,623.5   577,663   579,838        475.0  void triad_kernel<double>(T1 *, const T1 *, const T1 *)
     16.9       39,249,761        100  392,497.6  392,479.0   391,008   394,495        638.4  void mul_kernel<double>(T1 *, const T1 *)
     16.6       38,734,716        100  387,347.2  387,375.0   384,256   391,519      1,509.0  void dot_kernel<double>(const T1 *, const T1 *, T1 *, int)
     16.4       38,270,655        100  382,706.6  382,624.0   381,087   393,472      1,234.3  void copy_kernel<double>(const T1 *, T1 *)
      0.2          521,759          1  521,759.0  521,759.0   521,759   521,759          0.0  void init_kernel<double>(T1 *, T1 *, T1 *, T1, T1, T1)

[5/6] Executing 'cuda_gpu_mem_time_sum' stats report

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)  Min (ns)   Max (ns)   StdDev (ns)      Operation
 --------  ---------------  -----  ---------  --------  --------  ----------  -----------  ------------------
    100.0       80,716,337    103  783,653.8   2,304.0     1,792  27,002,161  4,533,035.9  [CUDA memcpy DtoH]

[6/6] Executing 'cuda_gpu_mem_size_sum' stats report

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
    805.652    103     7.822     0.003     0.003   268.435       45.360  [CUDA memcpy DtoH]
 [...]

… but ncu does not. Using a freshly downloaded version of ncu (version 2023.2.2.0, installed at the system level, does the same):

$ ~/nsight/ncu -v
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.1.0 (build 34702747) (public-release)
$ ~/nsight/ncu ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079844 s (=10086.003737 MBytes/sec)
Read: 0.000726 s (=1109384.116908 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1405071.807 0.00038     0.00039     0.00039
Mul         1361593.605 0.00039     0.00040     0.00040
Add         1364852.022 0.00059     0.00060     0.00059
Triad       1371100.648 0.00059     0.00060     0.00059
Dot         1351499.246 0.00040     0.00041     0.00040
==WARNING== No kernels were profiled.

System device information (note that this is a shared, centrally-managed system, and I have no ready access to either sudo or driver updates):

$ nvidia-smi
Tue Sep 24 15:29:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:17:00.0 Off |                    0 |
| N/A   36C    P0              79W / 400W |      4MiB / 40960MiB |     86%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          On  | 00000000:31:00.0 Off |                    0 |
| N/A   39C    P0              53W / 400W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          On  | 00000000:B1:00.0 Off |                    0 |
| N/A   55C    P0             343W / 400W |  30144MiB / 40960MiB |     98%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   47C    P0             113W / 400W |  38834MiB / 40960MiB |     80%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
$ echo $CUDA_VISIBLE_DEVICES
GPU-74927176-aee1-5f59-0810-869856abe095
$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-787fd111-7bc2-a11d-fab3-c02ed8a14e17)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-74927176-aee1-5f59-0810-869856abe095)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-0f8b9778-7976-890c-cc17-ef350b6e72de)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-c8b3fcd7-1317-bd52-8ba0-ce108833882a)

My ultimate goal is to compute flops (or better yet roofline plots) for a particular jax model under a various set of hyperparameter options. Unfortunately, Jax doesn’t provide valid flops estimates for kernels that use cublas (so all of the interesting ones), so I’m left with runtime, pratcial monitoring via cupti events.

veraj · September 25, 2024, 3:18am

Hi, @csubich

Sorry for the issue you met.
Can you confirm if ncu works for other simple CUDA sample ?

csubich · September 25, 2024, 12:35pm

No, ncu does not work for other cuda samples. My final intended use is a more complex Python code, but I picked up the ALCF tutorial code for this case to reduce the possibility of user error.

veraj · September 26, 2024, 4:29am

Hi, @csubich

$ ~/nsight/ncu -v
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.1.0 (build 34702747) (public-release)

Your output shows the NCU version is 2024.3.1.0. Can you double confirm 2023.2.2.0 has the same issue ?

csubich · September 26, 2024, 1:05pm

Yes, I can double confirm that 2023.2.2.0 behaves the same way:

$ which ncu
/usr/local/cuda/bin/ncu
$ ncu -v
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2023.2.2.0 (build 33188574) (public-release)
$ ncu ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079728 s (=10100.622983 MBytes/sec)
Read: 0.000773 s (=1041366.435757 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1404954.144 0.00038     0.00039     0.00038
Mul         1361262.176 0.00039     0.00040     0.00040
Add         1363959.717 0.00059     0.00060     0.00059
Triad       1369047.111 0.00059     0.00060     0.00059
Dot         1357266.692 0.00040     0.00041     0.00040
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

I checked against the currently-newest 2024.3.1.0 to help rule out any already-fixed bugs that might have been triggered.

veraj · September 27, 2024, 2:41am

Hi, @csubich

Thanks for the update. NCU 2023.2.2.0+ Driver 535.104.05+A100 works in our internal ENV, it has been tested for multi times. So it would be hard to tell why it fails in your environment.

Is there any other special setting ? How do you share the machine ? Are you running in docker or VM ? Has it ever worked with other version ?

You can create a file with below content and setting the NVLOG_CONFIG_FILE environment variable to this file.
Then run with “ncu”, you’ll get log recorded in /tmp/nvlog.log

$ /tmp/nvlog.log
UseStdout
ForceFlush
Format |$time|$sev:${level:-3}|$proc|$tid|$name>> $text

0i 0w 100ef 0IW 100EF global
-100i 100w 100ef 0IW 100EF regop_tgt_dta

csubich · September 27, 2024, 4:18pm

Is there any other special setting ? How do you share the machine

The machine is shared via a PBS Professional queuing system. The examples I’ve shown so far have all been from interactive jobs. I’m not aware of any system configuration out of the ordinary.

Are you running in docker or VM ?

No, the execution environment is not virtualized.

Has it ever worked with other version ?

I’ve not tried with other versions, so I don’t know if there was a historic time that this might have worked.

You can create a file with below content and setting the NVLOG_CONFIG_FILE environment variable to this file. Then run with “ncu”, you’ll get log recorded in /tmp/nvlog.log

Your formatting is garbled here. Please use Markdown’s literal representation for code-like items such as this. Knowing now to search for nvlog.config, I found a template in the nsight-compute distribution that I have used; its contents are included in the following transcript.

$ cp ~/nsight/host/target-linux-x64/nvlog.config.template ./nvlog.config
$ export NVLOG_CONFIG_FILE=$(pwd)/nvlog.config
$ cat nvlog.config
# Rename this file to nvlog.config and put into one of the two places:
#   * Next to the nsys-ui binary.
#   * Into your current directory.
                                       
# Enable all loggers:
+ 75iw 75ef 0IW 0EF   global
# Except for too verbose ones                           
- quadd_verbose_                           
                                             
# Append logs to a file:                               
$ nsys-ui.log                                          
                                                       
# Flush the log file after every log message:          
ForceFlush                                             
                                                       
# On Windows, use OutputDebugString():
OutputDebugString                                                                                    
                                             
# Log into stderr:                                                                                                  
UseStderr                    
      
# Specify logging format:                                                                      
# Simple format                                                                                
# Format $time $tid $name $text                                                                                                       

# A more verbose variant of logging format:
Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]: $text
$ ncu ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079682 s (=10106.451581 MBytes/sec)
Read: 0.000721 s (=1116398.673031 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1395241.814 0.00038     0.00039     0.00039
Mul         1357345.617 0.00040     0.00040     0.00040
Add         1370426.335 0.00059     0.00059     0.00059
Triad       1379641.443 0.00058     0.00059     0.00059
Dot         1355594.274 0.00040     0.00041     0.00040
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
$ cat nsys-ui.log
I12:14:11:465|TPS_Comms|1793889|:10[]: Creating AsioAsyncActionProcessor - Background 0x7c108b0
I12:14:11:465|TPS_Comms|1793889|:10[]: Creating AsioAsyncActionProcessor - Background 0x7c41120
I12:14:11:466|CmdlineProfiler|1793889|:2518[]: Using temporary file /var/tmp/pbs.347468.sc5pbs-001-ib/nsight-compute-174f-4024.ncu-rep
W12:14:11:508|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:508|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:509|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:509|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:509|NvRules|1793889|:511[]: Failed to add python file /home/csu001/Documents/NVIDIA Nsight Compute/2023.2.2/Sections/NvRules.py as rule
W12:14:11:510|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:510|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:512|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:512|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:513|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:513|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:515|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:515|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:517|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:517|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:518|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:518|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:519|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:519|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:521|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:521|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:522|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:522|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:524|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:524|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:526|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:526|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:527|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:527|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:529|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:529|NvRules|1793889|:245[]: QueryKind returned error 3
W12:14:11:532|NvRules|1793910|:327[]: Failed to find function: get_kind
W12:14:11:532|NvRules|1793889|:245[]: QueryKind returned error 3
I12:14:11:532|CmdlineProfiler|1793889|:228[]: Expanded 0 macros in file name
I12:14:11:532|pipe_utilities|1793889|:19[]: Contruct PipeDescriptor
I12:14:11:532|pipe_utilities|1793889|:36[]: Contruct Pipe
I12:14:11:532|pipe_utilities|1793889|:19[]: Contruct PipeDescriptor
I12:14:11:532|pipe_utilities|1793889|:57[]: Create Pipe from name (isOwner: 1)
I12:14:11:532|pipe_utilities|1793889|:26[]: Contruct PipeDescriptor (isOwner: 1)
I12:14:11:532|pipe_utilities|1793889|:67[]: Creating pipe /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0
I12:14:11:532|pipe_utilities|1793889|:91[]: Opened pipe fd /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0-ownerReadingStream for readi
ng (8)
I12:14:11:532|pipe_utilities|1793889|:96[]: Opened pipe fd /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0-ownerWritingStream for writi
ng (9)
I12:14:11:533|pipe_utilities|1793911|:36[]: Contruct Pipe
I12:14:11:533|pipe_utilities|1793911|:19[]: Contruct PipeDescriptor
I12:14:11:533|pipe_utilities|1793911|:57[]: Create Pipe from name (isOwner: 0)
I12:14:11:533|pipe_utilities|1793911|:26[]: Contruct PipeDescriptor (isOwner: 0)
I12:14:11:533|pipe_utilities|1793911|:91[]: Opened pipe fd /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0-ownerWritingStream for readi
ng (10)
I12:14:11:533|pipe_utilities|1793911|:96[]: Opened pipe fd /var/tmp/pbs.347468.sc5pbs-001-ib/NVIDIA-treetracker-d2qNpA/pipe-0-ownerReadingStream for writi
ng (11)
I12:14:11:534|pipe_utilities|1793911|:19[]: Contruct PipeDescriptor
I12:14:11:535|pipe_utilities|1793911|:19[]: Contruct PipeDescriptor
I12:14:12:673|CmdlineProfiler|1793889|:1307[]: End
I12:14:12:673|CmdlineProfiler|1793889|:1331[]: Profiler shutdown requested
I12:14:12:673|pipe_utilities|1793889|:110[]: Close Pipe (isOwner: 1)

I12:14:12:673|pipe_utilities|1793889|:19[]: Contruct PipeDescriptor
I12:14:12:673|pipe_utilities|1793889|:45[]: Reset Pipe (closePipe: 1)
I12:14:12:674|CmdlineProfiler|1793889|:39[]: Destroying KernelReplayHelper
I12:14:12:674|CmdlineProfiler|1793889|:271[]: Destroying ProfileHelper
I12:14:12:674|TPS_Session|1793889|:151[]: ===== SessionManager async close all sessions. NumSessions = 0 =====
I12:14:12:674|TPS_Session|1793889|:177[]: SessionManager running the ActionProcessor waiting for sessions to complete AsyncClose.
I12:14:12:674|TPS_Session|1793889|:183[]: ===== SessionManager async close completed =====
I12:14:12:674|TPS_Comms|1793889|:17[]: Destroying AsioAsyncActionProcessor - Background - 0x7c41120
I12:14:12:674|TPS_Comms|1793889|:27[]: AsioAsyncActionProcessor - Background - Stopping
I12:14:12:674|TPS_Comms|1793889|:17[]: Destructing the action processor 0x7c41120
I12:14:12:674|TPS_Comms|1793889|:29[]: Stopping the action processor 0x7c41120
I12:14:12:674|TPS_Comms|1793889|:17[]: Destroying AsioAsyncActionProcessor - Background - 0x7c108b0
I12:14:12:674|TPS_Comms|1793889|:27[]: AsioAsyncActionProcessor - Background - Stopping
I12:14:12:674|TPS_Comms|1793889|:17[]: Destructing the action processor 0x7c108b0
I12:14:12:674|TPS_Comms|1793889|:29[]: Stopping the action processor 0x7c108b0
I12:14:12:674|TPS_Comms|1793889|:28[]: Stopping foreground action processor.
I12:14:12:674|TPS_Comms|1793889|:17[]: Destructing the action processor 0x7c38150
I12:14:12:674|TPS_Comms|1793889|:29[]: Stopping the action processor 0x7c38150

csubich · September 27, 2024, 4:25pm

Another attempt with my reconstruction of your suggested nvlog.config file gives no logging output:

$ cat nvlog2.config # Log file listing
$ /tmp/nvlog.log
UseStdout
ForceFlush
Format |$time|$sev:${level:-3}|$proc|$tid|$name>> $text
+ 0i 0w 100ef 0IW 100EF global
-100i 100w 100ef 0IW 100EF regop_tgt_dta

$ export NVLOG_CONFIG_FILE=$(pwd)/nvlog2.config                                                                 
$ ncu --target-processes all ./cuda-stream                                                                      
BabelStream                                                 
Version: 5.0    
Implementation: CUDA
Running kernels 100 times
Precision: double                                      
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)          
Using CUDA device NVIDIA A100-SXM4-40GB                                                                                                                  
Driver: 12020                                               
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079660 s (=10109.295369 MBytes/sec)
Read: 0.000725 s (=1110064.618400 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1404237.559 0.00038     0.00039     0.00038
Mul         1360389.494 0.00039     0.00040     0.00040
Add         1369261.268 0.00059     0.00060     0.00059
Triad       1376688.568 0.00058     0.00060     0.00059
Dot         1352772.874 0.00040     0.00041     0.00040
==WARNING== No kernels were profiled.                   
$ ls -ls /tmp/nvlog.log
0 -rw-r--r-- 1 csu001 eccc_rpnatm 0 Sep 27 12:21 /tmp/nvlog.log
$ cat /tmp/nvlog.log # Blank

csubich · October 8, 2024, 12:37pm

It’s now been a week of radio silence on this topic, is there no suggestion for further investigation/debugging?

veraj · October 8, 2024, 1:42pm

Hi, @csubich

Sorry for the late response as we have a public holiday.
I have just pasted your log for our dev to check. Will let you know if there is any response.

veraj · October 10, 2024, 5:19am

Hi, @csubich

Which Linux distribution are you running and do you know if you run in Secure Execution Mode ?
As LD_PRELOAD is known to not work in Secure Execution Mode， which may cause this problem.

csubich · October 10, 2024, 1:22pm

The distribution is Red Hat Enterprise 8.3, and as far as I can tell secure execution mode is not enabled.

csubich · October 17, 2024, 12:33pm

It’s been another week, is there any update from the development side of things?

veraj · October 18, 2024, 3:51am

Hi, @csubich

Sadly no. According to dev, the log is not very conclusive as it is mainly for nsys usage.
Please use the attached config file to have another try which should provide more detailed info.
Sorry for the late response.
nvlogconfig.txt (445 Bytes)

csubich · October 18, 2024, 1:32pm

Here’s the update, at least the debugging gave some feedback this time:

$ export NVLOG_CONFIG_FILE=$(pwd)/nvlogconfig.txt
$ ncu --target-processes all ./cuda-stream
|09:24:35:703|inf:40 |                             ncu|3407075|     CmdlineProfiler|                    >> Using temporary file /var/tmp/pbs.514154.sc5pbs-001-ib/nsight-compute-420d-828e.ncu-rep
|09:24:35:756|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:756|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:758|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:758|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:758|wrn:50 |                             ncu|3407075|             NvRules|                    >> Failed to add python file /home/csu001/Documents/NVIDIA Nsight Compute/2023.2.2/Sections/NvRules.py as rule
|09:24:35:759|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:759|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:761|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:761|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:763|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:763|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:765|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:765|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:767|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:767|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:769|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:769|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:771|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:771|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:772|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:772|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:774|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:774|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3                                     
|09:24:35:777|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind                              
|09:24:35:777|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3
|09:24:35:779|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind
|09:24:35:779|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3
|09:24:35:780|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind
|09:24:35:780|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3
|09:24:35:781|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind
|09:24:35:781|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3
|09:24:35:783|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind
|09:24:35:783|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3
|09:24:35:786|wrn:50 |                             ncu|3407096|             NvRules|                    >> Failed to find function: get_kind
|09:24:35:786|wrn:50 |                             ncu|3407075|             NvRules|                    >> QueryKind returned error 3
|09:24:35:786|inf:20 |                             ncu|3407075|     CmdlineProfiler|                    >> Expanded 0 macros in file name
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079529 s (=10126.008751 MBytes/sec)
Read: 0.000768 s (=1048847.771753 MBytes/sec)
Function    MBytes/sec  Min (sec)   Max         Average
Copy        1393868.908 0.00039     0.00039     0.00039
Mul         1357688.875 0.00040     0.00040     0.00040
Add         1370734.243 0.00059     0.00059     0.00059
Triad       1380452.630 0.00058     0.00059     0.00059
Dot         1351101.304 0.00040     0.00041     0.00040
|09:24:36:901|inf:10 |                             ncu|3407075|     CmdlineProfiler|                    >> End
|09:24:36:901|inf:10 |                             ncu|3407075|     CmdlineProfiler|                    >> Profiler shutdown requested
==WARNING== No kernels were profiled.
|09:24:36:901|inf:10 |                             ncu|3407075|     CmdlineProfiler|                    >> Destroying KernelReplayHelper
|09:24:36:901|inf:10 |                             ncu|3407075|     CmdlineProfiler|                    >> Destroying ProfileHelper
$ cat /tmp/nvlog.log
|09:24:36:901|inf:10 |                             ncu|3407075|     CmdlineProfiler|                    >> End
|09:24:36:901|inf:10 |                             ncu|3407075|     CmdlineProfiler|                    >> Profiler shutdown requested
|09:24:36:901|inf:10 |                             ncu|3407075|     CmdlineProfiler|                    >> Destroying KernelReplayHelper
|09:24:36:901|inf:10 |                             ncu|3407075|     CmdlineProfiler|                    >> Destroying ProfileHelper

Following up on that ‘failed to add NvRules.py as a file’ warning, the file does exist and has seemingly valid contents:

$ md5sum "/home/csu001/Documents/NVIDIA Nsight Compute/2023.2.2/Sections/NvRules.py"
4c8b65df56f437168cd1d6228c17ba1c  /home/csu001/Documents/NVIDIA Nsight Compute/2023.2.2/Sections/NvRules.py
$ head "/home/csu001/Documents/NVIDIA Nsight Compute/2023.2.2/Sections/NvRules.py"
# This file was automatically generated by SWIG (http://www.swig.org).
# Version 4.1.0
#
# Do not make changes to this file unless you know what you are doing--modify
# the SWIG interface file instead.


# Copyright (c) 2019-2021, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without

csubich · October 18, 2024, 1:38pm

I’m guessing that ncu is not expecting some odd part of the system configuration, so I’d love to debug this further. This mis-behaviour seems to be rare but not unknown: in addition to the threads linked above, a new one was submitted here (No kernels were profiled - #2 by veraj) during the course of this discussion that seems to have the same symptoms.

I’m just stubborn enough to stick with it through meticulous debugging steps.

Let me know if the development group would like the log from strace as well. I’d attach it speculatively, but the log file is about 6 megabytes mostly thanks to about 53,000 invocations of clock_gettime.

veraj · October 21, 2024, 6:36am

Hi, @csubich

Is this the full content of the file ? Please provide the full result.

$ cat /tmp/nvlog.log
|09:24:36:901|inf:10 | ncu|3407075| CmdlineProfiler| >> End
|09:24:36:901|inf:10 | ncu|3407075| CmdlineProfiler| >> Profiler shutdown requested
|09:24:36:901|inf:10 | ncu|3407075| CmdlineProfiler| >> Destroying KernelReplayHelper
|09:24:36:901|inf:10 | ncu|3407075| CmdlineProfiler| >> Destroying ProfileHelper

Did you find there is “err” related info in this log ?

csubich · October 21, 2024, 9:01pm

Yes, I can confirm that those four lines are the full content of the file. I ran the test again just to be sure.

There is no ‘err’ info in the logfile.

veraj · October 22, 2024, 4:37am

@csubich

That’s weird, as in my machine, if one profiler session happens, it will output many details.

Does CUPTI profile work on your machine? You can execute several samples under /usr/local/cuda/extras/CUPTI/samples to check ?

csubich · October 22, 2024, 4:02pm

The CUPTI samples appear to work. After copying /usr/local/cuda to a user-writable directory for the build:

samples$ cd sass_metrics
sass_metrics$ ls
Makefile  sass_metrics.cu
sass_metrics$ make
"../../../../bin/nvcc"   -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -lineinfo  -c -I"../../../../include" -I"../../../../extras/CUPTI/include" -I"../../../../extras/CUPTI/samples/common" sass_metrics.cu "../../../../bin/nvcc"   -o sass_metrics sass_metrics.o -L ../../../../extras/CUPTI/lib64 -lcuda -L ../../../../lib64 -lcupti
sass_metrics$ ls
Makefile  sass_metrics  sass_metrics.cu  sass_metrics.o
sass_metrics$ ./sass_metrics
Device Num: 0
Lazy Patching Enabled
Device Name: NVIDIA A100-SXM4-40GB
Device compute capability: 8.0
Metric Name: smsp__sass_inst_executed, MetricID: 8913560818632243504, Metric Description: # of warp instructions executed
Enable SASS Patching
Launching VectorAdd                                                                                                     
Module cubinCrc: 1576875859
Kernel Name: _Z9VectorAddPKiS0_Pii
metric Name: smsp__sass_inst_executed                                                                                                   [Inst] pcOffset: 0x0    metricValue:    [0]: 1000
                [Inst] pcOffset: 0x10   metricValue:    [0]: 1000                                                                       [Inst] pcOffset: 0x20   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x30   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x40   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x50   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x60   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x70   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x80   metricValue:    [0]: 1000
                [Inst] pcOffset: 0x90   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xa0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xb0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xc0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xd0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xe0   metricValue:    [0]: 1000
                [Inst] pcOffset: 0xf0   metricValue:    [0]: 1000

Launching VectorSubtract
Num of Dropped Records: 16
Disable SASS Patching
Launching VectorMultiply

and

samples$ cd cupti_metric_properties/
cupti_metric_properties$ ls
cupti_metric_properties.cpp  Makefile
cupti_metric_properties$ make
cd ../extensions/src/profilerhost_util && make
make[1]: Entering directory '/fs/homeu2/eccc/mrd/rpnatm/csu001/cupti_tmp/cuda/extras/CUPTI/samples/extensions/src/profilerhost_util'
make[1]: 'libprofilerHostUtil.a' is up to date.
make[1]: Leaving directory '/fs/homeu2/eccc/mrd/rpnatm/csu001/cupti_tmp/cuda/extras/CUPTI/samples/extensions/src/profilerhost_util'
"../../../../bin/nvcc"    -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -c -I"../../../../include" -I../../include -../extensions/include/profilerhost_util -I../extensions/include/c_util -I../common cupti_metric_properties.cpp
"../../../../bin/nvcc"   -o cupti_metric_properties cupti_metric_properties.o -L ../../lib64  -lcuda -L ../../../../lib64 -lcupti -lnvperf_host -lnvperf_target -L ../extensions/src/profilerhost_util -lprofilerHostUtil -I"../../../../include" -I../../include -I../extensions/include/profilerhost_util -I../extensions/include/c_util -I../common
cupti_metric_properties$ ./cupti_metric_properties
CUDA Device Number: 0
Compute Capability of Device: 8.0
Queried Chip : GA100
Total metrics on the chip 212972
Metric Name                                                                             Num of Passes   Collection Method
dram__bytes.avg                                                                         1               HW
dram__bytes.avg.peak_sustained                                                          1               HW
dram__bytes.avg.peak_sustained_active                                                   1               HW
dram__bytes.avg.peak_sustained_active.per_second                                        1               HW
dram__bytes.avg.peak_sustained_elapsed                                                  1               HW
[...]
tpc__l1tex_m_l1tex2xbar_throughput.sum.pct_of_peak_sustained_active                     1               HW
tpc__l1tex_m_l1tex2xbar_throughput.sum.pct_of_peak_sustained_elapsed                    1               HW
tpc__l1tex_m_l1tex2xbar_throughput.sum.pct_of_peak_sustained_frame                      1               HW
tpc__l1tex_m_l1tex2xbar_throughput.sum.pct_of_peak_sustained_region                     1               HW

Topic		Replies	Views
About using ncu to profile the python code, which further called cu kernels Nsight Compute	13	1364	June 15, 2024
Can't Get NCU GUI To Import Properly Nsight Compute	8	1544	October 5, 2020
No kernel to profile when using nsight compute Nsight Compute cuda	8	1954	August 9, 2023
==ERROR== Failed to profile "createVersionVisualization" in process 12840 Nsight Compute	21	250	September 1, 2025
Cannot profile kernel from CUDA samples Nsight Compute	6	602	May 31, 2023
Run ncu command in ubuntu 20.04 Nsight Compute	7	6052	August 8, 2022
Nsight-Compute returns “No kernels were profiled” warning Nsight Compute	9	1725	July 27, 2023
Option to profile only master process Nsight Compute cuda	23	3913	December 1, 2023
No kernels were profiled Nsight Compute	2	152	November 8, 2024
Nsight-compute print "the application returned an error code (249)" Nsight Compute	5	1593	February 13, 2023

NSight Compute not finding kernels

Related topics