Nsight Systems does not collect CUDA events

Hi everyone,

I am puzzled as to why I cannot get Nsight Systems to work properly. It’s my first time using the profiler and posting here, so excuse me if the question turns out to be banal. I would be very glad if I could get some help.

I am trying to profile a Julia application I wrote using CUDA. I get the following error:

julia> CUDA.@profile #'some expression here using CUDA.jl' 
[ Info: Running under Nsight Systems, CUDA.@profile will automatically start the profiler

WARNING: CUDA tracing is required for cudaProfilerStart/Stop API support. Turning it on by default.
There are no active sessions.
ERROR: failed process: Process(/usr/local/bin/nsys stop, ProcessExited(1)) [1]

Stacktrace:...

caused by: Failed to compile PTX code (ptxas received signal 11)
If you think this is a bug, please file an issue and attach /tmp/jl_DLp64D.ptx
Stacktrace: ...

I’ve left out the stack traces as these are specific to Julia. Can post them if needed.

Upon launching using profile command:

~$ nsys profile julia
End of file

I can get the profile session to start using the UI, but no CUDA events are recorded: “No CUDA events collected. Does the process use CUDA?”

image

I have a GeForce GTX 1050 Ti GPU.

This is the output of uname -a

~$ uname -a
Linux copenhagen 5.13.0-7620-generic #20~1634827117~21.04~874b071-Ubuntu SMP Fri Oct 29 15:06:55 UTC  x86_64 x86_64 x86_64 GNU/Linux

Output of cat /proc/sys/kernel/perf_event_paranoid

~$ cat /proc/sys/kernel/perf_event_paranoid
1

This is the output of nvidia-smi

~$ nvidia-smi
Mon Nov 22 08:51:19 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 30%   39C    P0    N/A /  75W |    965MiB /  4034MiB |      8%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Output of /usr/local/bin/nsys --version

~$ /usr/local/bin/nsys --version
NVIDIA Nsight Systems version 2021.5.1.77-4a17e7d

By the way, Nsight Systems doesn’t work for CUDA C either. I compiled an example under ussr/lib/cuda/samples/0_Simple/vectorAdd and still get the same error:

~:/usr/lib/cuda/samples/0_Simple/vectorAdd$ sudo make
~:/usr/lib/cuda/samples/0_Simple/vectorAdd$ ls
Makefile  NsightEclipse.xml  readme.txt  vectorAdd  vectorAdd.cu  vectorAdd.o
~:/usr/lib/cuda/samples/0_Simple/vectorAdd$ ./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
~:/usr/lib/cuda/samples/0_Simple/vectorAdd$ nsys profile vectorAdd
End of file

Just so to exclude this being an error coming from the Julia side of things.

@liuyis can you take a look at this?

Hi @cozmaden, could you try if the following command works:

~:/usr/lib/cuda/samples/0_Simple/vectorAdd$ nsys profile -t none -s none --cpuctxsw=none vectorAdd

?

Unfortunately I had to resolve the problem quickly to keep on working on a project. I have reinstalled my operating system, since I was just testing Pop!_OS for a limited time.

Currently got back to an an arch-based distro (EndeavourOS) with the latest drivers and toolkit versions from pacman and I did not encounter this problem.

So I can only speculate now. Might have been a problem with the older drivers available via apt on Pop!_OS with the combination of older toolkit versions.

Thanks for getting back anyway @hwilper @liuyis

I had similar issues with both Julia and C. And this one works for me. Why this one works?

But in the generated “report1.nsys-rep” file, the timeline is empty when I open this file with nsight-sys ui

This is caused due to the -t none option.

From the manual:

If the none option is selected, no APIs are traced and no other API can be selected.

Hi @Alexandre_Chen, the command line I provided in Nsight Systems does not collect CUDA events - #4 by liuyis was just an attempt to narrow down the issue, but not a solution. It disables all trace & sampling options so the report will be empty.

Which Nsys version were you using? Do you hit the same “End of file” error even for the simple vectorAdd app?

I updated my WSL version and installed the latest NVIDIA driver for WSL on windows 11 and it seems this problem has gone. I am using Driver version 510.06, CUDA version 11.6, and NVIDIA Nsight Systems version 2021.2.4.12-a25c8fd now.

2 Likes

I was having the same issue.

➜  Poisson_Julia git:(master) nvidia-smi
Thu Dec 16 02:49:46 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 23%   42C    P3    26W / 120W |   2671MiB /  5910MiB |     29%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2432      G   /usr/lib/xorg/Xorg               1719MiB |
|    0   N/A  N/A      2579      G   /usr/bin/gnome-shell              199MiB |
|    0   N/A  N/A      2683      G   ...mviewer/tv_bin/TeamViewer        1MiB |
|    0   N/A  N/A      3267      G   ...AAAAAAAAA= --shared-files      109MiB |
|    0   N/A  N/A      4372      G   ...AAAAAAAAA= --shared-files      309MiB |
|    0   N/A  N/A      6392      G   ...AAAAAAAAA= --shared-files       92MiB |
|    0   N/A  N/A    172635      G   ...AAAAAAAAA= --shared-files      167MiB |
|    0   N/A  N/A    207779      G   ...ost-linux-x64/nsys-ui.bin       65MiB |
+-----------------------------------------------------------------------------+

I can’t use nsys to profile either julia or normal binary executable compiled from nvcc by nsys launch without specifying --trace=cuda. I saw “End of File” information as well. A lot examples in online video tutorial just have nsys launch without specifying --trace=cuda because by default this is the case. Is this a bug of nsys?

This is an example

➜  CUDA_code nsys profile ./kernel_abc
End of file
➜  CUDA_code nsys profile --trace=cuda,cublas ./kernel_abc
Generating '/tmp/nsys-report-1d37.qdstrm'
[1/1] [========================100%] report6.nsys-rep
Generated:
    /home/alexandre/Code/CUDA_code/report6.nsys-rep
➜  CUDA_code nsys stats report6.nsys-rep 
Generating SQLite file report6.sqlite from report6.nsys-rep
Exporting 9521 events: [===================================================100%]
Using report6.sqlite for SQL queries.
Running [/opt/nvidia/nsight-systems/2021.5.1/target-linux-x64/reports/nvtxsum.py report6.sqlite]... SKIPPED: report6.sqlite does not contain NV Tools Extension (NVTX) data.

Running [/opt/nvidia/nsight-systems/2021.5.1/target-linux-x64/reports/osrtsum.py report6.sqlite]... SKIPPED: report6.sqlite does not contain OS Runtime trace data.

Running [/opt/nvidia/nsight-systems/2021.5.1/target-linux-x64/reports/cudaapisum.py report6.sqlite]... 

 Time (%)  Total Time (ns)  Num Calls     Avg (ns)         Med (ns)        Min (ns)       Max (ns)     StdDev (ns)           Name         
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  ------------  ---------------------
     94.1    2,138,206,904          1  2,138,206,904.0  2,138,206,904.0  2,138,206,904  2,138,206,904           0.0  cudaDeviceSynchronize
      5.9      135,082,224          2     67,541,112.0     67,541,112.0        159,657    134,922,567  95,291,767.5  cudaMalloc           
      0.0           21,788          2         10,894.0         10,894.0          5,067         16,721       8,240.6  cudaMemset           
      0.0           14,896          3          4,965.3          3,223.0            666         11,007       5,386.2  cudaLaunchKernel     

Running [/opt/nvidia/nsight-systems/2021.5.1/target-linux-x64/reports/gpukernsum.py report6.sqlite]... 

 Time (%)  Total Time (ns)  Instances     Avg (ns)         Med (ns)        Min (ns)       Max (ns)     StdDev (ns)                   Name                  
 --------  ---------------  ---------  ---------------  ---------------  -------------  -------------  -----------  ---------------------------------------
     99.9    2,136,014,745          1  2,136,014,745.0  2,136,014,745.0  2,136,014,745  2,136,014,745          0.0  kernel_A(double *, int, int)           
      0.1        1,086,041          1      1,086,041.0      1,086,041.0      1,086,041      1,086,041          0.0  kernel_C(double *, const double *, int)

Running [/opt/nvidia/nsight-systems/2021.5.1/target-linux-x64/reports/gpumemtimesum.py report6.sqlite]... 

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)    Operation  
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  -------------
    100.0        1,121,392      2  560,696.0  560,696.0   543,084   578,308     24,907.1  [CUDA memset]

Running [/opt/nvidia/nsight-systems/2021.5.1/target-linux-x64/reports/gpumemsizesum.py report6.sqlite]... 

The binary executable was compiled from this code using nvcc.

@Alexandre_Chen Thanks for the information. nsys profile without any switch will turn on CUDA, NVTX, OSRT and OpenGL traces. There may be some issue with OSRT (most likely), NVTX or OpenGL trace that caused the End of file error, so you won’t hit it by explicitly specifying --trace=cuda,cublas.

Are you still able to reproduce it? If so could you try nsys profile --trace=osrt -s none --cpuctxsw=none ./kernel_abc to confirm if it’s an OSRT issue?

Thanks

➜  CUDA_code nsys profile --trace=osrt -s none --cpuctxsw=none ./kernel_abc
Generating '/tmp/nsys-report-9994.qdstrm'
[1/1] [========================100%] report7.nsys-rep
Generated:
    /home/alexandre/Code/CUDA_code/report7.nsys-rep

Does the issue still happen with just nsys profile ./kernel_abc?

Could you also try the following:

nsys profile -t none ./kernel_abc
nsys profile -t nvtx -s none --cpuctxsw=none ./kernel_abc
nsys profile -t opengl -s none --cpuctxsw=none ./kernel_abc

Thanks

1 Like

I just tried this, and --trace=opengl is the one causing a problem (nothing happens, and “End of file” is being printed). I have:

  • NVIDIA Nsight Systems version 2022.1.1.61-1d07dc0
  • NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4

I’ve attached a the strace log of this process and forked ones. You can find the “End of file” write here to. I hope this trace will be useful when trying to figure out what goes wrong. I would debug myself further, but without access to the code, it’s hard. You can see that the End of file print is done by a forked process.
stracelog.txt (1.6 MB)

You can see in the beginning the dynamic loader loading stuff from my CUDA 11.4 install folder. Hopefully those are not a problem. It should be noted that for this use case, I’m only interested in tracing OpenGL.

@courteauxmartijn Thanks for the update. We are able to reproduce this at our side. An internal ticket has been opened to track and fix this, but we don’t have a specific estimate yet.

For now there are some workarounds:

  1. If you don’t need to trace OpenGL, remove opengl from --trace (or -t) (note that the default value for --trace is cuda,nvtx,osrt,opengl when you do not explicitly specify it).

  2. If you do need to trace OpenGL, try:

    • Set VK_ICD_FILENAMES to empty value or only specific ICD files, e.g export VK_ICD_FILENAMES=

    • rename /usr/share/vulkan/icd.d/lvp_icd.x86_64.json

@liuyis Dang! That VK_ICD_FILENAMES variable did it. Am I missing out on some features now, or does it work fully correctly? I would think it works fully correctly, as the VK_ prefix suggests Vulkan, and I’m profiling OpenGL. Thanks for coming back to me so quickly!

Hi @courteauxmartijn, glad to hear it works for you. Yes it works correctly with this workaround.

This helped fixing an issue on Ubuntu 20.04 with NSIGHT Systems and the -t vulkan flag as well.
This command-line would report “End of file”:
And nsys-ui would not even start the application or Vulkan tracing.

After renaming, both nsys -t vulkan and nsys-ui started working.

I’m curious, what is lvp_icd.x86_64.json for exactly? There are other files for Intel, AMD and NVIDIA.
https://packages.debian.org/sid/amd64/mesa-vulkan-drivers/filelist