After switching to nsight-systems-2022.4.2 (from a 2022.1 version) on an Ubuntu 20.04.5 LTS installed via NVIDIA’s apt repo, I’ve noticed my tests that run several instances of nsys in parallel (from a typical cmake/make target) now lead to a situation where some of the nsys instances just hang: they run forever and are at close to 100% CPU utilization. This happens about 50% of the time:
Another 20.04 LTS system that has a similar configuration with nsys 2022.3.4.34-133b775 doesn’t have this issue as well.
I also tried the latest available version as of today from Getting Started with Nsight Systems | NVIDIA Developer downloaded and installed via the run file (NsightSystems-linux-public-2022.4.1.21-0db2c85.run), and it exhibits the same issue.
I noticed that the issue persists even when running a single instance – it’s just that running concurrently increases the chances of encountering it, so the former happens less frequently.
This appears like an issue with nsys 2022.4. I like the fact that nsys 2022.4 now supports reporting thread block sizes with GPU kernel summary rows, and I’d like to use it if it’s free of any issues.
Sure, I am happy to share the report. I’ll have to switch back to the 2022.4 version and experiment with your suggestions - I should be able to get back in a day.
Thanks for sharing the information, glad we’ve got a WAR for your use case. For further investigation, is it possible to set up the appliation on our side, so we can reproduce the issue and debug?
If that’s not possible, could you help collecting debugging logs with the following steps:
Save the following content to nvlog.config:
+ 75iwef global
- quadd_verbose_
$ /tmp/nsight-sys.log
ForceFlush
Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]:$text
Add NVLOG_CONFIG_FILE=<path to 'nvlog.config'> to your Nsys CLI command line, for example NVLOG_CONFIG_FILE=/tmp/nvlog.config nsys profile --force-overwrite=true -o gpu_ <command>.
Run the command as usual, and if it works as expected, there should be a log file at /tmp/nsight-sys.log. Share the file to us and we will try to figure out why it could hang.
If you are running multiple instances, it will be best if you can only append NVLOG_CONFIG_FILE=<path to 'nvlog.config'> to one of the instances, otherwise the logs will be mixed and it will be harder to investigate. Also, make sure the log is collected on an instance where the hanging did happen.