I have an application where multiple CUDA streams are used to achieve more concurrency.
Nsight Systems doesn’t provide any information about kernels running in streams (in default stream either), saying that memory operations (memset in my case) take 100% of time what is obviously wrong.
Nsight Systems shows the following (mind there are no warnings or errors): External Media
The output is the same either I profile via Nsight Systems GUI or with “nsys profile -t cuda ./myapp” command and then import a report file in GUI.
Versions, hardware:
Ubuntu 18.04, GeForce RTX 2070 (the same situation is on Tesla V100), Driver Version: 418.67, CUDA Version: 10.1.
What’s wrong?
UPDATE: the same situation is with the app that uses default stream for all the calculations (one of older versions of the app). So, multiple streams are not the case, kernels are just not traced.
There was dynamic parallelism in the app, after I avoided it everything worked fine. Question 1: Is it a bug?
Still, there were no warning, errors or any other messages indicating any issues or restrictions on profiling my app. Question 2: How can I suggest an improvement, or file a bug in case it is a bug, to Nsight Systems?
Thank you for drawing our attention to this! It looks like Nsight Systems currently doesn’t trace CDP kernels correctly. We’ll get a bug filed internally, and will update this thread once we have more information.
Unfortunately due to another issue, you would need to create a file, named config.ini with the following line: “HandleInvalidEvents=true” in the directory, where you launch the nsys command.
For example
% cat config.ini
HandleInvalidEvents=true
% nsys profile -t cuda ./yourApp
Downloaded the new version, created config.ini with the content you posted. Several other issues…
When I run nsys with the app with dynamic parallelism present in the code, but it is not even run in it, I get “an illegal memory access was encountered” error. Without nsys it finished without any errors.
After that Nsight Compute GUI doesn’t show anything that has some sense. Check out the screenshot. External Media
When I comment the code with dynamic parallelism everything works as expected.
I can’t say that the solution with creation of some additional file is convenient and user-friendly. How people could know that a file should be created without reading this post?
Hi Timofei, unfortunately the screenshot doesn’t seem to be available for us (410 Gone). Can you please attach it directly in the reply, so that we could see it. Thanks!