For a problem where duration on Nsys report and the actual application runtime are too different

Hi, I recently tried to profil the LLM-training application to run in a Multi-GPU environment, but I had a big problem.
The duration(148s) on the nsys report is about three times longer than the actual application runtime(44s).
Even considering Profiling overhead, it’s such a big difference that I don’t know what the problem is.
Below is the command I ran for profiling.

**

accelerate launch --no_python nsys profile -t cuda,nvtx -o ./rname.%p python3 train.py [args…]

**

I think the most likely cause is that the current environment is a virtual machine (VM) environment, would this be the cause?
Thank you for reading it.

I would recommend you go with:

nsys profile --trace=cuda,nvtx --sample=none --cpuctxsw=none

This will trace CUDA and NVTX, but turn off the CPU side backtraces,

Let me know if that gets you what you need.

Thank you for your reply. There is no big difference using the option you advised. When I run the same application environment on the local machine, there was no big difference in runtime like this.
By the way, the problematic experimental environment was worked at Docker out of Docker (DooD), did it have an effect?

when you open the run in the GUI, is there a lot of activity in the profiler overhead row?

@skottapalli can you help here?

@psh2018314072 - could you please share the nsys-rep file so that I can investigate?

Of course. Here it is.
report_1.zip (54.5 MB)

There is nothing obvious in the report file that suggests why you are seeing 3x overhead with profiling your application.

I can suggest a few things to help debug here:

  1. Please upgrade to 2024.2 version of nsys (the latest release on Nsight Systems | NVIDIA Developer
  2. Please remove --gpu-metric-device=all switch from your command line and see if it helps with the overhead
  3. Along with 2, please add --sample=none --cpuctxsw=none switches to your command line and see if it helps with the overhead
  4. Along with 2 and 3, please remove cudnn and nvtx from the -t switch in your command line and see if it helps with the overhead
  5. Along with 2 and 3, please use -t none in your command line and see if it helps with the overhead.

This should help pinpoint which feature is causing the overhead in your application.

I’m sorry for the late response because I had a schedule.
As you advised, even if I put in the option, there is no big change. Instead, I ran the machine to run the application on my local server, not on the VM, and Profiling Overhead was significantly reduced! How should I understand this situation?

As you advised, even if I put in the option, there is no big change.

Which of the 5 suggestions did you try? Please share the report file for each of the 5 suggestions I made.

I ran the machine to run the application on my local server, not on the VM, and Profiling Overhead was significantly reduced! How should I understand this situation?

I don’t think I understand your two setups to help you here. I have no visibility into your VM setup to advise you one way or another.