Nsys hangs when profile cuda applications

When i run nsys profile for any cuda applications, it hangs forever, e.g.

> nsys profile -t cuda cuda-samples/bin/x86_64/linux/release/deviceQuery
> /home/qtfan/cuda-samples/bin/x86_64/linux/release/deviceQuery Starting...
> 
>  CUDA Device Query (Runtime API) version (CUDART static linking)

It seems get stuck at first cuda API call. Does anyone know possible reasons ?

I’m using ubuntu 22.04 and more information for my system:
qtfan@legion:~$ nsys -v
NVIDIA Nsight Systems version 2023.2.3.1001-32894139v0

qtfan@legion:~$ nvidia-smi
Wed Feb 28 16:43:44 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 … On | 00000000:01:00.0 On | N/A |
| N/A 41C P8 19W / 80W | 1438MiB / 6144MiB | 8% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1921 G /usr/lib/xorg/Xorg 1006MiB |
| 0 N/A N/A 2146 G /usr/bin/gnome-shell 93MiB |
| 0 N/A N/A 15502 G …9901597,15340099953604003785,131072 286MiB |
| 0 N/A N/A 16520 C+G …tems/2023.2.3/target-linux-x64/nsys 4MiB |
±--------------------------------------------------------------------------------------+

qtfan@legion:~$ nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 2
Linux Distribution = Ubuntu
Linux Kernel Version = 6.5.0-21-generic: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail

See the product documentation at Nsight Systems — nsight-systems 2024.1 documentation for more information,
including information on how to set the Linux Kernel Paranoid Level.

qtfan@legion:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

What is the command you are using to run Nsys?

Assuming that you are running from the command line, can you try adding “–trace=cuda,nvtx” to your command line? This bypasses the OS runtime trace, which sometimes has issues in complicated systems.

I will also suggest that you update your Nsys to the latest 2024.1 at developer.nvidia.com/nsight-systems

Hi @hwilper,
I used ‘-trace=cuda’ options, such as:
nsys profile -trace=cuda deviceQuery

In this deviceQuery example, I confirmed that it got stuck at first cuda API call cudaGetDeviceCount().
It is very strange, acutally i can run nsight-systems a few weeks ago, but now it hangs. I have no idea what changes on my system and there is no clue about what’s the problems.

I just tried 2024.1.1, it has the same problem.

Do you have any idea for the possible reason?

Thanks

Can you check top and see if you have any zombie Nsys (or CUPTI) processes on the system?

@liuyis can you triage this if it gtfan needs more assistance?

I checked nvidia-smi command, it shows Xorg and gnome-shell process before nsys command. I also tried reboot system and update system to latest repro, but no help.

@qtfan Could you try latest Nsys release 2024.1 from https://developer.nvidia.com/nsight-systems/get-started?

If it still does not work, could you try collecting logs with the following steps:

  1. Save the following content to nvlog.config:
+ 75iwef global

- quadd_verbose_

$ /tmp/nsight-sys.log

ForceFlush

Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]:$text
  1. Add NVLOG_CONFIG_FILE=<path to 'nvlog.config'> to your Nsys CLI command line, for example NVLOG_CONFIG_FILE=/tmp/nvlog.config nsys profile --trace=cuda ...
  2. Run the command as usual, and if it works as expected, there should be a log file at /tmp/nsight-sys.log. Share the file to us and we will try to figure out why it could hang.

Also, could you try the option --trace=none - this is not a solution, it just helps us confirm if the issue is related to CUDA trace

@liuyis I tried 2024.1, it still does not work. I uploaded logs following your steps. (log was collected after i terminated the hanging process by ctr+c).

nsight-sys.log (840.2 KB)

I tried the option --trace=none, it can run through.

@qtfan Thank you. Unfortunately it’s still not clear what went wrong from the log. Is it possible for me to access your system to debug the issue?

@liuyis The system is in local network, i am not sure how to access it outside.
I will try reinstall everything from scratch. If you have any other way for debug, please let me know.

One thing we can try is for you to attach GDB to the applicaiton when it hangs, and capture the backtrace. But we cannot share debugging version of libraries to you, so I’m not sure if meaningful backtrace can be capture from your side.