For a problem where duration on Nsys report and the actual application runtime are too different

psh2018314072 · April 11, 2024, 7:50am

Hi, I recently tried to profil the LLM-training application to run in a Multi-GPU environment, but I had a big problem.
The duration(148s) on the nsys report is about three times longer than the actual application runtime(44s).
Even considering Profiling overhead, it’s such a big difference that I don’t know what the problem is.
Below is the command I ran for profiling.

**

accelerate launch --no_python nsys profile -t cuda,nvtx -o ./rname.%p python3 train.py [args…]

**

I think the most likely cause is that the current environment is a virtual machine (VM) environment, would this be the cause?
Thank you for reading it.

hwilper · April 11, 2024, 4:08pm

I would recommend you go with:

nsys profile --trace=cuda,nvtx --sample=none --cpuctxsw=none

This will trace CUDA and NVTX, but turn off the CPU side backtraces,

Let me know if that gets you what you need.

psh2018314072 · April 12, 2024, 7:42am

Thank you for your reply. There is no big difference using the option you advised. When I run the same application environment on the local machine, there was no big difference in runtime like this.
By the way, the problematic experimental environment was worked at Docker out of Docker (DooD), did it have an effect?

hwilper · April 12, 2024, 12:35pm

when you open the run in the GUI, is there a lot of activity in the profiler overhead row?

@skottapalli can you help here?

skottapalli · April 12, 2024, 5:49pm

@psh2018314072 - could you please share the nsys-rep file so that I can investigate?

psh2018314072 · April 15, 2024, 3:50pm

Of course. Here it is.
report_1.zip (54.5 MB)

skottapalli · April 15, 2024, 8:10pm

There is nothing obvious in the report file that suggests why you are seeing 3x overhead with profiling your application.

I can suggest a few things to help debug here:

Please upgrade to 2024.2 version of nsys (the latest release on Nsight Systems | NVIDIA Developer
Please remove --gpu-metric-device=all switch from your command line and see if it helps with the overhead
Along with 2, please add --sample=none --cpuctxsw=none switches to your command line and see if it helps with the overhead
Along with 2 and 3, please remove cudnn and nvtx from the -t switch in your command line and see if it helps with the overhead
Along with 2 and 3, please use -t none in your command line and see if it helps with the overhead.

This should help pinpoint which feature is causing the overhead in your application.

psh2018314072 · April 23, 2024, 4:01am

I’m sorry for the late response because I had a schedule.
As you advised, even if I put in the option, there is no big change. Instead, I ran the machine to run the application on my local server, not on the VM, and Profiling Overhead was significantly reduced! How should I understand this situation?

skottapalli · April 26, 2024, 3:32pm

As you advised, even if I put in the option, there is no big change.

Which of the 5 suggestions did you try? Please share the report file for each of the 5 suggestions I made.

I ran the machine to run the application on my local server, not on the VM, and Profiling Overhead was significantly reduced! How should I understand this situation?

I don’t think I understand your two setups to help you here. I have no visibility into your VM setup to advise you one way or another.