Show Cuda HW in Nsight System

Hi, I am using nsight system in Ray(version==2.9.0).

I runned a data parallelism in two A100 and two T4 separately with same nsight config.

self.nsight_config = {
                                "t": "cuda,nvtx,osrt,python-gil",
                                "o": f"'{nsight_output_file}'",
                                "cudabacktrace": "true",
                                "cuda-memory-usage": "true",
                                "x": "true", "f": "true",
                                "trace-fork-before-exec": "true",
                                "gpu-metrics-devices": "all"
                                
                            }

However when i put the result on timeline view, there are 2 problems:

  1. In timeline view of T4, there will be a CUDA HW , however same of T4, there is no CUDA HW.
    2 T4 Timeline View:

2 A100 Timeline View:

  1. There are many “MissingData” in GPU metrics, and i also tried to decrease Metrics frequency to 10Hz, but still “Missing Data”

I’m going to make a suggestion and then pass you to our gpu-metrics expert.

First of all, I would recommend you remove --cuda-backtrace from the configuration file. This causes Nsys to take a backtrace on every call, and is exceedingly high overhead and may perturb the application. This is a change that we recommended be made to the Ray initial configuration, and will be in the next Ray version. Basically, you shouldn’t use this until you have a finer view of where you actually need those backtraces. Since this means less buffering behind the scenes it may help with some of your other issues as well.

Looping in @pkovalenko for the gpu-metrics question.

Could you please collect a log from A100 profiling session that shows missing data ranges?
https://docs.nvidia.com/nsight-systems/UserGuide/index.html#logging
or easier (comment out the line containing quadd_verbose_ in nvlog.config.template):

NVLOG_CONFIG_FILE=/opt/nvidia/nsight-systems/2024.5.1/host-linux-x64/nvlog.config.template /opt/nvidia/nsight-systems/2024.5.1/host-linux-x64/nsys-ui

Thank you very much for your suggestion!

Hi, thank you very much for your response. I have modified my nsys profile arguments according to @**hwilper’s suggestion:

        self.nsight_config = {
                                "t": "cuda,nvtx,osrt,python-gil",
                                "o": f"'{nsight_output_file}'",
                                "cuda-memory-usage": "true",
                                "x": "true", "f": "true",
                                "trace-fork-before-exec": "true",
                                "gpu-metrics-devices": "all"
        }

I then ran my program separately on 2 T4s and 2 A100s. Here are the profile and log results, which include two zip files:

P.S.

  1. On the A100, CUDA:HW is not visible in the timeline view, and there are occasional missing data in the GPU metrics line.
  2. On the T4, CUDA:HW is visible, but there is missing data when the program reaches a specific point.
  3. The zip file includes multiple nsys-rep files, all of which were generated by the same program launched by Ray. I usually place them in multi-view for review.

A100 [ 3 filename.nsys-rep + nsys_agent.log ]
nsightA100_nsys-rep&log.zip (11.9 MB)

T4 [ 3 filename.nsys-rep]
nsightT4_nsys-rep.zip (52.2 MB)

P.S.

  1. I run the program in a Docker container with the --privileged option.
  2. Host: I’ve tried both Ubuntu 20.04 and CentOS 7.9 with the recommended kernel versions as the Docker host OS.
  3. Docker container info: Ubuntu 20.04, Python 3.8, Ray 2.9.0.

A100 reports have no missing data ranges and there’s no log for T4 reports. Please attach a log for T4 report that has missing data in GPU Metrics rows.

Here is the T4 nsys-agent.log.

nsightT4_nsys-rep_1.zip (16.6 MB)

Additionally, I’m very want to know how to enable the CUDA:HW in the timeline view for the A100 as well.

@liuyis

Looking in the log, it seems that you’re experiencing a known issue that is currently being investigated. At this time there’s no known workaround. @liuyis should be able to comment on CUDA HW part.

Hi @lssyes_shuai , the options you used generally look correct. The only question I have is the trace-fork-before-exec option - is there a reason you have to enable it? Does the application indeed have processes that are forked but not calling exec? If not, try removing this option and see if there is any difference.

If the issue persists, could you try collecting some additional logs for analysis? I’ve checked the logs you collected but somehow it is missing the injection part that is related to CUDA tracing.

You can follow these steps:

  1. Save the following content to /tmp/nvlog.config:
+ 100iwef   global
$ /tmp/nsight-sys.log
ForceFlush
Format $sevc$time|${name:0}|PID${pid:0}|TID${tid:0}|${file:0}:${line:0}[${sfunc:0}]:$text
  1. Add the option --env-var=NVLOG_CONFIG_FILE=/tmp/nvlog.config to Nsys CLI command line. Based on the config structure you shared, it might look like:
        self.nsight_config = {
                                ......,
                                "env-var": "NVLOG_CONFIG_FILE=/tmp/nvlog.config"
        }
  1. Run the profiling session. After it finishes, there should be a log file at /tmp/nsight-sys.log. Share this file, together with the report from the profiling session to us.

Please also make sure any existing content under /tmp/nsight-sys.log is removed before you running a new profiling session, otherwise logs not belonging to the profiling session will make it harder for analysis.

Also, it will be best if you can collect logs on both T4 and A100 so by comparing them we might be able to find out the root issue easier.

Thanks,
Liuyi

Hi, here’s my response regarding the issue you raised:

is there a reason you have to enable it? Does the application indeed have processes that are forked but not calling exec?

My program is launched by Python and includes modifications to XLA code written using the Pybind11 library. I’m not entirely sure how XLA is executed, so I added the trace-fork-before-exec option just to be safe. However, when I run the exact same Docker container on a T4 and use the same nsys options, I can see cuda:hw in the T4’s profile result, but I can’t see cuda:hw in the A100’s result.

I followed your suggestion and removed the trace-fork-before-exec option, then profiled my program again, but still did not see cuda:hw.

I then re-added the trace-fork-before-exec option and added the log as you suggested. The results are as follows:
nsightA100_NO_cudahw_log.zip (12.4 MB)

Thanks,

AtLiang

Thanks for collecting the new logs. I do notice some weirdness from the log, but I’m not very clear why it was happening. One thing I noticed is that atexit handler we registered in injection library was not invoked, so the clean up rountines on application exit were not executed. That might cause the injection buffer to be not flushed before the app exiting can result in missing events. Do you know if the application was exiting gracefully? Or was it forcibly killed or terminated?

Is it possible to collect the same log on the T4 GPU system so we can compare and see if there’s any additional insights?

Also, is it possible for us to run this application and reproduce the issue on our side?

1 Like

Thank you so much for your help! This issue has been bothering me for a long time!!!

What you mentioned about the atexit handler gave me an idea. I found that I was using ray.kill to terminate my application, and I changed it to actor.ray_terminate.remote() based on the information from the Ray documentation. After that, the cuda:hw metrics appeared. Thanks to your help, my problem has been perfectly resolved. I’m extremely grateful to all of you guys.

Also, the reason why the T4 could display cuda:hw, I guess, might be because the GPU overhead on the T4 is relatively higher, which led to the partial loading of the atexit handler.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.