Nsys hangs frequently when run in parallel

After switching to nsight-systems-2022.4.2 (from a 2022.1 version) on an Ubuntu 20.04.5 LTS installed via NVIDIA’s apt repo, I’ve noticed my tests that run several instances of nsys in parallel (from a typical cmake/make target) now lead to a situation where some of the nsys instances just hang: they run forever and are at close to 100% CPU utilization. This happens about 50% of the time:

(output of top)

4078426 user-+ 20 0 5436512 134672 24776 S 100.3 0.1 45:54.47 nsys
3881769 user-+ 20 0 5526624 211676 26144 S 100.0 0.2 100:00.58 nsys

  • Never had this issue with 2022.1.
  • Another 20.04 LTS system that has a similar configuration with nsys 2022.3.4.34-133b775 doesn’t have this issue as well.
  • I also tried the latest available version as of today from Getting Started with Nsight Systems | NVIDIA Developer downloaded and installed via the run file (NsightSystems-linux-public-2022.4.1.21-0db2c85.run), and it exhibits the same issue.
  • I noticed that the issue persists even when running a single instance – it’s just that running concurrently increases the chances of encountering it, so the former happens less frequently.

This appears like an issue with nsys 2022.4. I like the fact that nsys 2022.4 now supports reporting thread block sizes with GPU kernel summary rows, and I’d like to use it if it’s free of any issues.

Package: nsight-systems-2022.4.2
Version: 2022.4.2.1-df9881f

| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |

"Ubuntu 20.04.5 LTS"

Product Name                          : NVIDIA GeForce RTX 3090

@liuyis can you take a look at this when you get a chance?

Hi @uday1, what’s the Nsys command you used? Do you have the terminal outputs from Nsys when it’s hanging?

It’s an nsys profile followed by an nsys stats for each of those instances:

$ nsys profile --force-overwrite=true -o gpu_ <command> && echo "Perf: ... kernel takes `nsys stats --format csv --report gpukernsum --timeunit=msec gpu_.nsys-rep  | grep copy_global_memref_kernel | cut -f 2 -d ','` ms"

Could you try the following to see if the issue repro or not (this could help us locate which part went wrong):

  1. Remove the subsequent nsys stats
  2. nsys profile -t cuda -s none --cpuctxsw=none --force-overwrite=true -o gpu_ <command>
  3. nsys profile -t osrt -s none --cpuctxsw=none --force-overwrite=true -o gpu_ <command>
  4. nsys profile -t nvtx,opengl -s none --cpuctxsw=none --force-overwrite=true -o gpu_ <command>

Also, it is possible to share a report (nsys-rep file) that you captured on a successful run?

Sure, I am happy to share the report. I’ll have to switch back to the 2022.4 version and experiment with your suggestions - I should be able to get back in a day.

Tracing the CUDA APIs is all I’m interested in, and I can confirm that:

  1. The issue is reproducible with -t cuda -s none added as well.
  2. The issue is not reproducible with -t cuda -s none --cpuctxsw=none.

I assume the latter set of flags that work are sufficient for my purpose.

A report of a successful run is attached.
gpu_.nsys-rep (312.8 KB)

Thanks for sharing the information, glad we’ve got a WAR for your use case. For further investigation, is it possible to set up the appliation on our side, so we can reproduce the issue and debug?

If that’s not possible, could you help collecting debugging logs with the following steps:

  1. Save the following content to nvlog.config:
+ 75iwef global

- quadd_verbose_

$ /tmp/nsight-sys.log


Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]:$text
  1. Add NVLOG_CONFIG_FILE=<path to 'nvlog.config'> to your Nsys CLI command line, for example NVLOG_CONFIG_FILE=/tmp/nvlog.config nsys profile --force-overwrite=true -o gpu_ <command>.
  2. Run the command as usual, and if it works as expected, there should be a log file at /tmp/nsight-sys.log. Share the file to us and we will try to figure out why it could hang.
  3. If you are running multiple instances, it will be best if you can only append NVLOG_CONFIG_FILE=<path to 'nvlog.config'> to one of the instances, otherwise the logs will be mixed and it will be harder to investigate. Also, make sure the log is collected on an instance where the hanging did happen.