I’m encountering a problem similar to other threads where no definitive resolution was ever found: my system will find and profile CUDA kernels (python or otherwise) with nsys
but not ncu
. For example, following the ALCF stream benchmark tutorial, nsys works and reports that kernels exist:
$ nsys -v
NVIDIA Nsight Systems version 2023.2.3.1004-33186433v0
$ nsys profile --stats=true -t cuda ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.081068 s (=9933.704586 MBytes/sec)
Read: 0.000795 s (=1012682.471917 MBytes/sec)
Function MBytes/sec Min (sec) Max Average
Copy 1378238.901 0.00039 0.00039 0.00039
Mul 1343356.075 0.00040 0.00041 0.00040
Add 1365405.096 0.00059 0.00059 0.00059
Triad 1374085.843 0.00059 0.00059 0.00059
Dot 1338072.742 0.00040 0.00041 0.00041
Generating '/var/tmp/pbs.327750.sc5pbs-001-ib/nsys-report-27a8.qdstrm'
[1/6] [========================100%] report3.nsys-rep
[2/6] [========================100%] report3.sqlite
[3/6] Executing 'cuda_api_sum' stats report
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- ----------- ----------- --------- ---------- ----------- ---------------------------------
58.9 196,051,613 401 488,906.8 524,707.0 385,636 588,752 96,540.4 cudaDeviceSynchronize
36.5 121,424,138 103 1,178,875.1 403,723.0 396,596 27,189,162 4,497,385.9 cudaMemcpy
2.0 6,814,547 2 3,407,273.5 3,407,273.5 3,314,304 3,500,243 131,478.7 cudaGetDeviceProperties_v2_v12000
1.7 5,660,376 4 1,415,094.0 942,962.5 251,390 3,523,061 1,448,320.3 cudaMalloc
0.6 1,872,954 501 3,738.4 2,978.0 2,627 264,191 11,741.0 cudaLaunchKernel
0.3 930,521 4 232,630.3 223,240.5 70,699 413,341 140,318.0 cudaFree
0.0 1,354 1 1,354.0 1,354.0 1,354 1,354 0.0 cuModuleGetLoadingMode
[4/6] Executing 'cuda_gpu_kern_sum' stats report
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------------------------------------------------------
25.0 58,246,173 100 582,461.7 582,527.0 581,279 583,775 479.3 void add_kernel<double>(const T1 *, const T1 *, T1 *)
24.8 57,865,436 100 578,654.4 578,623.5 577,663 579,838 475.0 void triad_kernel<double>(T1 *, const T1 *, const T1 *)
16.9 39,249,761 100 392,497.6 392,479.0 391,008 394,495 638.4 void mul_kernel<double>(T1 *, const T1 *)
16.6 38,734,716 100 387,347.2 387,375.0 384,256 391,519 1,509.0 void dot_kernel<double>(const T1 *, const T1 *, T1 *, int)
16.4 38,270,655 100 382,706.6 382,624.0 381,087 393,472 1,234.3 void copy_kernel<double>(const T1 *, T1 *)
0.2 521,759 1 521,759.0 521,759.0 521,759 521,759 0.0 void init_kernel<double>(T1 *, T1 *, T1 *, T1, T1, T1)
[5/6] Executing 'cuda_gpu_mem_time_sum' stats report
Time (%) Total Time (ns) Count Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Operation
-------- --------------- ----- --------- -------- -------- ---------- ----------- ------------------
100.0 80,716,337 103 783,653.8 2,304.0 1,792 27,002,161 4,533,035.9 [CUDA memcpy DtoH]
[6/6] Executing 'cuda_gpu_mem_size_sum' stats report
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------
805.652 103 7.822 0.003 0.003 268.435 45.360 [CUDA memcpy DtoH]
[...]
… but ncu does not. Using a freshly downloaded version of ncu (version 2023.2.2.0, installed at the system level, does the same):
$ ~/nsight/ncu -v
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.1.0 (build 34702747) (public-release)
$ ~/nsight/ncu ./cuda-stream
BabelStream
Version: 5.0
Implementation: CUDA
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using CUDA device NVIDIA A100-SXM4-40GB
Driver: 12020
Memory: DEFAULT
Reduction kernel config: 432 groups of (fixed) size 1024
Init: 0.079844 s (=10086.003737 MBytes/sec)
Read: 0.000726 s (=1109384.116908 MBytes/sec)
Function MBytes/sec Min (sec) Max Average
Copy 1405071.807 0.00038 0.00039 0.00039
Mul 1361593.605 0.00039 0.00040 0.00040
Add 1364852.022 0.00059 0.00060 0.00059
Triad 1371100.648 0.00059 0.00060 0.00059
Dot 1351499.246 0.00040 0.00041 0.00040
==WARNING== No kernels were profiled.
System device information (note that this is a shared, centrally-managed system, and I have no ready access to either sudo or driver updates):
$ nvidia-smi
Tue Sep 24 15:29:34 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:17:00.0 Off | 0 |
| N/A 36C P0 79W / 400W | 4MiB / 40960MiB | 86% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:31:00.0 Off | 0 |
| N/A 39C P0 53W / 400W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-40GB On | 00000000:B1:00.0 Off | 0 |
| N/A 55C P0 343W / 400W | 30144MiB / 40960MiB | 98% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-40GB On | 00000000:CA:00.0 Off | 0 |
| N/A 47C P0 113W / 400W | 38834MiB / 40960MiB | 80% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
$ echo $CUDA_VISIBLE_DEVICES
GPU-74927176-aee1-5f59-0810-869856abe095
$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-787fd111-7bc2-a11d-fab3-c02ed8a14e17)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-74927176-aee1-5f59-0810-869856abe095)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-0f8b9778-7976-890c-cc17-ef350b6e72de)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-c8b3fcd7-1317-bd52-8ba0-ce108833882a)
My ultimate goal is to compute flops (or better yet roofline plots) for a particular jax model under a various set of hyperparameter options. Unfortunately, Jax doesn’t provide valid flops estimates for kernels that use cublas (so all of the interesting ones), so I’m left with runtime, pratcial monitoring via cupti events.