Nsys is not collecting kernel data

Hi,
When I’m trying to profile with nsys no CUDA kernel data is collected. When I use Nsight System kernels doesn’t show up in the timeline.

Running the following command sudo nsys profile --stats=true -t cuda <app path> in WSL2 on Windows 11. The output is:

Generating '/tmp/nsys-report-48d3.qdstrm'
[1/6] [========================100%] report6.nsys-rep
[2/6] [========================100%] report6.sqlite
[3/6] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                Name
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ---------------------------------
     86.4       5726048558       1182  4844372.7   614840.0      3099  88526493   11888877.3  cuLibraryLoadData
      9.4        621915573        633   982489.1   745363.0       498  15934364    2063962.5  cudaDeviceSynchronize
      1.7        111028852        458   242421.1   115570.5       615   5818108     546821.2  cudaStreamSynchronize
      0.7         45797514         50   915950.3     3745.5       970  29626149    4344456.9  cudaFree
      0.6         42045311       4828     8708.6     6155.0      2567    245540      11405.5  cudaLaunchKernel
      0.3         23154932        167   138652.3    73144.0     39458   2466145     259722.2  cudaMemcpy
      0.3         20672259        637    32452.5    40169.0      6110    238175      27230.1  cudaMemcpyAsync
      0.2         11414782          6  1902463.7     7085.0       978   5858052    2941052.4  cudaStreamCreateWithFlags
      0.2         10872657       1383     7861.6     4521.0      1598    346083      14211.1  cudaMemsetAsync
      0.1          9897352         44   224939.8    14274.0      2459   1493941     400263.1  cudaMalloc
      0.0          2221999       2469      900.0      572.0       383     52343       1860.8  cudaEventRecord
      0.0          1075344          1  1075344.0  1075344.0   1075344   1075344          0.0  cuLibraryUnload
      0.0           784357          9    87150.8    69571.0     27117    206559      61072.6  cudaMemcpyToSymbol
      0.0           335479        104     3225.8      694.5       316     54593       7733.6  cudaEventCreateWithFlags
      0.0           217769       1149      189.5      159.0        91      4141        168.1  cuGetProcAddress_v2
      0.0            95910        107      896.4      404.0       271     13236       1812.6  cudaEventDestroy
      0.0            36526          7     5218.0     3884.0       883     17516       5929.3  cudaStreamDestroy
      0.0            28425         28     1015.2      852.5       333      2766        684.1  cudaEventQuery
      0.0            15508          3     5169.3     2724.0      1197     11587       5610.1  cudaEventCreate
      0.0             4239          3     1413.0     1491.0      1190      1558        196.0  cuInit
      0.0             3002          1     3002.0     3002.0      3002      3002          0.0  cudaStreamCreate
      0.0             2918          1     2918.0     2918.0      2918      2918          0.0  cudaGetDeviceProperties_v2_v12000
      0.0             1993          4      498.3      204.0       151      1434        624.4  cuModuleGetLoadingMode

[4/6] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: <path>/report6.sqlite does not contain CUDA kernel data.
[5/6] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: <path>/report6.sqlite does not contain GPU memory data.
[6/6] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: <path>/report6.sqlite does not contain GPU memory data.
Generated:
    <path>/report6.nsys-rep
    <path>/report6.sqlite

As you can see there is no data at all about the CUDA kernels.

Nvidia-smi output:

nvidia-smi
Thu Mar  2 11:41:30 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 528.49       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A200...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   52C    P8     7W /  35W |   1244MiB /  8192MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Nsys status:

 sudo nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 2
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.90.1-microsoft-standard-WSL2: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

Nsys version:

nsys --version
NVIDIA Nsight Systems version 2023.1.2.43-32377213v0

Hope you are able to help.

Can you attach the .nsys-rep file you are working with?

I sent you the .nsys-rep file as a pm.

Looping in @rknight to assist with this.

Hi orjan.grefstad,

I looked at the report1.nsys-rep file. I assume this is the file that you provided.

I noticed a couple of entries in the Diagnostics section of the nsys-rep file that could be hints of the issue. See the following two entries;

Installed CUDA driver version (12.0) is not supported by this build of Nsight Systems. CUDA trace will be collected using libraries for driver version 12.1

CUDA device 0: Unified Memory cannot be traced on devices that don’t support peer-to-peer transfers.Please verify that SLI/NVLink is functioning properly.

Is it possible that you have a prototype driver installed? Since this is running in the WSL2 environment, @jasoncohen might also have some insight into this issue.

Sorry, collecting CUDA kernel launches under WSL2 is not yet supported. I am actively working on that right now and I expect to have it shipping in the upcoming Nsight Systems release.

1 Like

Ok. Looking forward to the next release. Would be nice if you could keep your guide up to date: 1. NVIDIA GPU Accelerated Computing on WSL 2 — CUDA on WSL 12.3 documentation. As far as I could tell it should be supported:

Developer tools - Profilers - Volta and later (Using driver r525+ and Windows 11)
but you also have contradicting note further down:
Developers who require profiling support are encouraged to find alternatives in the meanwhile.
I updated to Windows 11 to get this feature working. Unfortunately I’m stuck with Windows and WSL due to work restrictions.

@jasoncohen did this ship in the new Nsight Systems release (2023.2.1)? I’m still getting a similar error to OP:

$ nsys profile --stats=true -t cuda python profiler_test2.py
/home/<me>/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:319: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
Initializing NVTX monkey patches
Done with NVTX monkey patching
Ready!
Generating '/tmp/nsys-report-a987.qdstrm'
[1/6] [========================100%] report2.nsys-rep
[2/6] [========================100%] report2.sqlite
[3/6] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)    Max (ns)   StdDev (ns)                Name
 --------  ---------------  ---------  ------------  ------------  ----------  ----------  -----------  ---------------------------------
     88.0       3345133423          2  1672566711.5  1672566711.5  1425984287  1919149136  348720209.0  cudaDeviceSynchronize
      6.8        259351440        404      641959.0       11596.5        6472   135218111    8925118.5  cudaLaunchKernel
      5.2        196952259          2    98476129.5    98476129.5    92401193   104551066    8591257.6  cudaMalloc
      0.0            25199          7        3599.9        1853.0         772        9127       3362.3  cudaStreamIsCapturing_v10000
      0.0            11872          1       11872.0       11872.0       11872       11872          0.0  cudaGetDeviceProperties_v2_v12000
      0.0             1704          1        1704.0        1704.0        1704        1704          0.0  cuModuleGetLoadingMode

[4/6] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: <path>/report2.sqlite does not contain CUDA kernel data.
[5/6] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: <path>/report2.sqlite does not contain GPU memory data.
[6/6] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: <path>/report2.sqlite does not contain GPU memory data.
Generated:
    <path>/report2.nsys-rep
    <path>/report2.sqlite

nsys --version gives NVIDIA Nsight Systems version 2023.2.1.122-32598524v0

I’m also seeing similar issues in WSL2. I’m using NVIDIA Nsight Systems version 2023.1.2.43-32377213v0 and am seeing this output from a minimal example code.

/usr/local/cuda-12.1/bin/nsys profile --stats=true -t cuda -c cudaProfilerApi python test2.py
Capture range started in the application.
Generating '/tmp/nsys-report-ff83.qdstrm'
Capture range ended in the application.
[1/6] [========================100%] report9.nsys-rep
[2/6] [========================100%] report9.sqlite
[3/6] Executing 'cuda_api_sum' stats report
SKIPPED: <path>/report9.sqlite does not contain CUDA trace data.
[4/6] Executing 'cuda_gpu_kern_sum' stats report
SKIPPED: <path>/report9.sqlite does not contain CUDA kernel data.
[5/6] Executing 'cuda_gpu_mem_time_sum' stats report
SKIPPED: <path>/report9.sqlite does not contain GPU memory data.
[6/6] Executing 'cuda_gpu_mem_size_sum' stats report
SKIPPED: <path>/report9.sqlite does not contain GPU memory data.
Generated:
    <path>/report9.nsys-rep
    <path>/report9.sqlite

The code I am testing is just a simple Python script using Cupy version 12.0.0 shown below.

import cupy as cp
import cupyx as cpx


add_kernel = cp.RawKernel(r'''
extern "C" __global__
void my_add(const float* x1, const float* x2, float* y) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;
    y[tid] = x1[tid] + x2[tid];
}
''', 'my_add')

with cpx.profiler.profile():
        num = 1024
        a = cp.random.normal(0, 1.0, num)
        b = cp.random.normal(0, 1.0, num)
        c = cp.zeros(num)
        add_kernel((1,), (1024,), (a, b, c))
        out = cp.sum(c)
        print(out)

@jasoncohen Hi, Any updates on when there will be WSL support?

Hello, Mr.jasoncohen, is there any update for wsl 2 available?

I’ll ping him directly.

Hello, it seems nsys still cannot collect kernel data with WSL2 .
Is there any good news?

Sorry for the slow response here. WSL2 support didn’t make it into 2023.3, but I am planning to get it into 2023.4.

If you’re curious what the hold-up is… The major challenge is with synchronization between timestamps acquired in user-mode vs. in kernel-mode, because in WSL2 the kernel-mode drivers are outside the VM, and user-mode code is inside the VM – the hypervisor applies different TSC offsets inside/outside the VM. For security, we do CPU/GPU time synchronization from the kernel-mode driver, so in WSL this means we now also have to do user/kernel synchronization of the CPU timestamps. There’s no direct way to access these offsets, so we measure it the best we can. Also, Nsight Systems performs much better when using TSC timestamps (one or two CPU instructions, takes only a few ns) instead of platform-API timestamps (takes 100s of ns). We prefer TSC mode, but we check that the CPU supports a reliable TSC and fall back to platform-API mode if not. WSL’s VM unfortunately un-sets the bit in CPUID that claims the TSC is reliable, despite the fact that it is actually reliable as long as you keep at least one user process running in WSL, which the nsys daemon does during data collection. So we’ve had to make a bit more complex method of detecting the reliability of TSC given WSL/Hyper-V’s behavior. And finally, the platform-API mode needs to still work, and that involves not just measuring an offset between user-mode and kernel-mode, but rather a linear mapping, so it’s a bit more complex to acquire the synchronization points in platform-API mode. Most of this work is done now, so I expect it to ship in 2023.4.

4 Likes

Did this not make it into 2023.4 either? Still getting the same errors with NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0

1 Like

@jasoncohen Hi Jason, could you please provide further updates about the integration of CUDA kernel data collection into WSL? Thanks.

Hi, Jason.

Are you able to provide an update as to when you roughly expect nsys to be able to provide CUDA traces under wsl2? I’m running NVIDIA Nsight Systems version 2023.4.1.97-234133557503v0, but I’m still seeing the warnings about missing CUDA data. Thank you.

Does profiling work with 2024.1.1 nsys version?

me too, need help

Hello, is there any good news now? Expecting some progess.