Show Cuda HW in Nsight System

lssyes_shuai · August 17, 2024, 1:23pm

Hi, I am using nsight system in Ray(version==2.9.0).

I runned a data parallelism in two A100 and two T4 separately with same nsight config.

self.nsight_config = {
                                "t": "cuda,nvtx,osrt,python-gil",
                                "o": f"'{nsight_output_file}'",
                                "cudabacktrace": "true",
                                "cuda-memory-usage": "true",
                                "x": "true", "f": "true",
                                "trace-fork-before-exec": "true",
                                "gpu-metrics-devices": "all"
                                
                            }

However when i put the result on timeline view, there are 2 problems:

In timeline view of T4, there will be a CUDA HW , however same of T4, there is no CUDA HW.
2 T4 Timeline View:

T4 Timeline View846×684 70.6 KB

2 A100 Timeline View:

There are many “MissingData” in GPU metrics, and i also tried to decrease Metrics frequency to 10Hz, but still “Missing Data”

hwilper · August 19, 2024, 2:50pm

I’m going to make a suggestion and then pass you to our gpu-metrics expert.

First of all, I would recommend you remove --cuda-backtrace from the configuration file. This causes Nsys to take a backtrace on every call, and is exceedingly high overhead and may perturb the application. This is a change that we recommended be made to the Ray initial configuration, and will be in the next Ray version. Basically, you shouldn’t use this until you have a finer view of where you actually need those backtraces. Since this means less buffering behind the scenes it may help with some of your other issues as well.

Looping in @pkovalenko for the gpu-metrics question.

pkovalenko · August 19, 2024, 3:44pm

Could you please collect a log from A100 profiling session that shows missing data ranges?
https://docs.nvidia.com/nsight-systems/UserGuide/index.html#logging
or easier (comment out the line containing quadd_verbose_ in nvlog.config.template):

NVLOG_CONFIG_FILE=/opt/nvidia/nsight-systems/2024.5.1/host-linux-x64/nvlog.config.template /opt/nvidia/nsight-systems/2024.5.1/host-linux-x64/nsys-ui

lssyes_shuai · August 19, 2024, 4:06pm

Thank you very much for your suggestion!

lssyes_shuai · August 19, 2024, 4:44pm

Hi, thank you very much for your response. I have modified my nsys profile arguments according to @**hwilper’s suggestion:

        self.nsight_config = {
                                "t": "cuda,nvtx,osrt,python-gil",
                                "o": f"'{nsight_output_file}'",
                                "cuda-memory-usage": "true",
                                "x": "true", "f": "true",
                                "trace-fork-before-exec": "true",
                                "gpu-metrics-devices": "all"
        }

I then ran my program separately on 2 T4s and 2 A100s. Here are the profile and log results, which include two zip files:

P.S.

On the A100, CUDA:HW is not visible in the timeline view, and there are occasional missing data in the GPU metrics line.
On the T4, CUDA:HW is visible, but there is missing data when the program reaches a specific point.
The zip file includes multiple nsys-rep files, all of which were generated by the same program launched by Ray. I usually place them in multi-view for review.

A100 [ 3 filename.nsys-rep + nsys_agent.log ]
nsightA100_nsys-rep&log.zip (11.9 MB)

T4 [ 3 filename.nsys-rep]
nsightT4_nsys-rep.zip (52.2 MB)

P.S.

I run the program in a Docker container with the --privileged option.
Host: I’ve tried both Ubuntu 20.04 and CentOS 7.9 with the recommended kernel versions as the Docker host OS.
Docker container info: Ubuntu 20.04, Python 3.8, Ray 2.9.0.

pkovalenko · August 20, 2024, 9:12am

A100 reports have no missing data ranges and there’s no log for T4 reports. Please attach a log for T4 report that has missing data in GPU Metrics rows.

lssyes_shuai · August 21, 2024, 6:08am

Here is the T4 nsys-agent.log.

nsightT4_nsys-rep_1.zip (16.6 MB)

Additionally, I’m very want to know how to enable the CUDA:HW in the timeline view for the A100 as well.

hwilper · August 21, 2024, 3:04pm

@liuyis

pkovalenko · August 21, 2024, 3:16pm

Looking in the log, it seems that you’re experiencing a known issue that is currently being investigated. At this time there’s no known workaround. @liuyis should be able to comment on CUDA HW part.

liuyis · August 21, 2024, 5:30pm

Hi @lssyes_shuai , the options you used generally look correct. The only question I have is the trace-fork-before-exec option - is there a reason you have to enable it? Does the application indeed have processes that are forked but not calling exec? If not, try removing this option and see if there is any difference.

If the issue persists, could you try collecting some additional logs for analysis? I’ve checked the logs you collected but somehow it is missing the injection part that is related to CUDA tracing.

You can follow these steps:

Save the following content to /tmp/nvlog.config:

+ 100iwef   global
$ /tmp/nsight-sys.log
ForceFlush
Format $sevc$time|${name:0}|PID${pid:0}|TID${tid:0}|${file:0}:${line:0}[${sfunc:0}]:$text

Add the option --env-var=NVLOG_CONFIG_FILE=/tmp/nvlog.config to Nsys CLI command line. Based on the config structure you shared, it might look like:

        self.nsight_config = {
                                ......,
                                "env-var": "NVLOG_CONFIG_FILE=/tmp/nvlog.config"
        }

Run the profiling session. After it finishes, there should be a log file at /tmp/nsight-sys.log. Share this file, together with the report from the profiling session to us.

Please also make sure any existing content under /tmp/nsight-sys.log is removed before you running a new profiling session, otherwise logs not belonging to the profiling session will make it harder for analysis.

Also, it will be best if you can collect logs on both T4 and A100 so by comparing them we might be able to find out the root issue easier.

Thanks,
Liuyi

lssyes_shuai · August 21, 2024, 7:54pm

Hi, here’s my response regarding the issue you raised:

is there a reason you have to enable it? Does the application indeed have processes that are forked but not calling exec?

My program is launched by Python and includes modifications to XLA code written using the Pybind11 library. I’m not entirely sure how XLA is executed, so I added the trace-fork-before-exec option just to be safe. However, when I run the exact same Docker container on a T4 and use the same nsys options, I can see cuda:hw in the T4’s profile result, but I can’t see cuda:hw in the A100’s result.

I followed your suggestion and removed the trace-fork-before-exec option, then profiled my program again, but still did not see cuda:hw.

I then re-added the trace-fork-before-exec option and added the log as you suggested. The results are as follows:
nsightA100_NO_cudahw_log.zip (12.4 MB)

Thanks,

AtLiang

liuyis · August 21, 2024, 9:55pm

Thanks for collecting the new logs. I do notice some weirdness from the log, but I’m not very clear why it was happening. One thing I noticed is that atexit handler we registered in injection library was not invoked, so the clean up rountines on application exit were not executed. That might cause the injection buffer to be not flushed before the app exiting can result in missing events. Do you know if the application was exiting gracefully? Or was it forcibly killed or terminated?

Is it possible to collect the same log on the T4 GPU system so we can compare and see if there’s any additional insights?

Also, is it possible for us to run this application and reproduce the issue on our side?

lssyes_shuai · August 22, 2024, 5:15am

Thank you so much for your help! This issue has been bothering me for a long time!!!

What you mentioned about the atexit handler gave me an idea. I found that I was using ray.kill to terminate my application, and I changed it to actor.ray_terminate.remote() based on the information from the Ray documentation. After that, the cuda:hw metrics appeared. Thanks to your help, my problem has been perfectly resolved. I’m extremely grateful to all of you guys.

Also, the reason why the T4 could display cuda:hw, I guess, might be because the GPU overhead on the T4 is relatively higher, which led to the partial loading of the atexit handler.

system · September 5, 2024, 5:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
'cuda HW' field is missing Profiling Linux Targets nsight	6	238	January 9, 2025
Nsight Systems does not collect CUDA events Profiling Linux Targets	21	9983	January 11, 2023
Generating CUPTI_* tables with nsys Profiling Linux Targets cuda	25	1941	January 12, 2023
Help decipher logs(No GPU associated to the given GPU ID) Profiling Linux Targets	38	5036	November 28, 2022
Nsight nsys not collecting any CUDA kernel data (2023.1.2.43-32377213v0) Profiling Linux Targets	19	2977	September 14, 2023
Nsys Does not Show the kernels output Profiling Embedded Targets	21	3591	October 20, 2022
Nsys doesn't show cuda kernel and memory data Profiling Linux Targets cuda , kernel	10	875	December 7, 2024
No CUDA events were collected. Single machine 1 gpu profiling Profiling x86 Windows Targets cuda	5	1430	March 8, 2021
GPU metrics not working in Nsight System / Compute Profiling Linux Targets cuda , nsight , performance-metrics	2	883	December 12, 2023
Nsys Does not Track CUDA Api events Profiling Linux Targets	5	1204	December 22, 2022

Show Cuda HW in Nsight System

Related topics