I’m going to make a suggestion and then pass you to our gpu-metrics expert.
First of all, I would recommend you remove --cuda-backtrace from the configuration file. This causes Nsys to take a backtrace on every call, and is exceedingly high overhead and may perturb the application. This is a change that we recommended be made to the Ray initial configuration, and will be in the next Ray version. Basically, you shouldn’t use this until you have a finer view of where you actually need those backtraces. Since this means less buffering behind the scenes it may help with some of your other issues as well.
Looping in @pkovalenko for the gpu-metrics question.
I then ran my program separately on 2 T4s and 2 A100s. Here are the profile and log results, which include two zip files:
P.S.
On the A100, CUDA:HW is not visible in the timeline view, and there are occasional missing data in the GPU metrics line.
On the T4, CUDA:HW is visible, but there is missing data when the program reaches a specific point.
The zip file includes multiple nsys-rep files, all of which were generated by the same program launched by Ray. I usually place them in multi-view for review.
A100 reports have no missing data ranges and there’s no log for T4 reports. Please attach a log for T4 report that has missing data in GPU Metrics rows.
Looking in the log, it seems that you’re experiencing a known issue that is currently being investigated. At this time there’s no known workaround. @liuyis should be able to comment on CUDA HW part.
Hi @lssyes_shuai , the options you used generally look correct. The only question I have is the trace-fork-before-exec option - is there a reason you have to enable it? Does the application indeed have processes that are forked but not calling exec? If not, try removing this option and see if there is any difference.
If the issue persists, could you try collecting some additional logs for analysis? I’ve checked the logs you collected but somehow it is missing the injection part that is related to CUDA tracing.
You can follow these steps:
Save the following content to /tmp/nvlog.config:
+ 100iwef global
$ /tmp/nsight-sys.log
ForceFlush
Format $sevc$time|${name:0}|PID${pid:0}|TID${tid:0}|${file:0}:${line:0}[${sfunc:0}]:$text
Add the option --env-var=NVLOG_CONFIG_FILE=/tmp/nvlog.config to Nsys CLI command line. Based on the config structure you shared, it might look like:
Run the profiling session. After it finishes, there should be a log file at /tmp/nsight-sys.log. Share this file, together with the report from the profiling session to us.
Please also make sure any existing content under /tmp/nsight-sys.log is removed before you running a new profiling session, otherwise logs not belonging to the profiling session will make it harder for analysis.
Also, it will be best if you can collect logs on both T4 and A100 so by comparing them we might be able to find out the root issue easier.
Hi, here’s my response regarding the issue you raised:
is there a reason you have to enable it? Does the application indeed have processes that are forked but not calling exec?
My program is launched by Python and includes modifications to XLA code written using the Pybind11 library. I’m not entirely sure how XLA is executed, so I added the trace-fork-before-exec option just to be safe. However, when I run the exact same Docker container on a T4 and use the same nsys options, I can see cuda:hw in the T4’s profile result, but I can’t see cuda:hw in the A100’s result.
I followed your suggestion and removed the trace-fork-before-exec option, then profiled my program again, but still did not see cuda:hw.
I then re-added the trace-fork-before-exec option and added the log as you suggested. The results are as follows: nsightA100_NO_cudahw_log.zip (12.4 MB)
Thanks for collecting the new logs. I do notice some weirdness from the log, but I’m not very clear why it was happening. One thing I noticed is that atexit handler we registered in injection library was not invoked, so the clean up rountines on application exit were not executed. That might cause the injection buffer to be not flushed before the app exiting can result in missing events. Do you know if the application was exiting gracefully? Or was it forcibly killed or terminated?
Is it possible to collect the same log on the T4 GPU system so we can compare and see if there’s any additional insights?
Also, is it possible for us to run this application and reproduce the issue on our side?
Thank you so much for your help! This issue has been bothering me for a long time!!!
What you mentioned about the atexit handler gave me an idea. I found that I was using ray.kill to terminate my application, and I changed it to actor.ray_terminate.remote() based on the information from the Ray documentation. After that, the cuda:hw metrics appeared. Thanks to your help, my problem has been perfectly resolved. I’m extremely grateful to all of you guys.
Also, the reason why the T4 could display cuda:hw, I guess, might be because the GPU overhead on the T4 is relatively higher, which led to the partial loading of the atexit handler.