Multi Node Profiling with Nsight Systems

Greetings,

I want to know what are the best practices to profile a PyTorch multi-node training job with SLURM. I am interested in:

  • Communication and synchronization between nodes
  • GPU memory utilization
  • NIC metrics and data exchange
  • Event timeline (something similar to Horovod timeline).

For now I am using this command

srun nsys profile -w true --trace=cuda,nvtx,osrt,cudnn,cublas,mpi --cudabacktrace=true -x true --gpu-metrics-device=all --sample=none --force-overwrite=true --nic-metrics=true \
    --duration=120 --capture-range=cudaProfilerApi --output=logs/${EXP_NAME}/%q{SLURM_NODEID}_%q{SLURM_PROCID}.nsys-rep \
    $torchrun_cmd

Moreover, I watched a YouTube video and it mentions this in the slides

I am not sure what does it mean and how to write the command to avoid this.

Best Regards

Greetings,

The video is suggesting to move your nsys execution into a script, so that you can modify the command line arguments based on if it is the “local rank 0”. The rationale for that is because if you have multiple ranks that are running on the same physical machine, the run will fail because GPU metrics collection can only be done by one process at a time. NIC metrics are also a “system-wide” collection, so there is no reason to collect it multiple times by each rank on the machine.

Make a script (e.g. profile-app.sh) and put the nsys profile .... command in it, but without --nic-metrics or --gpu-metrics-device. Then include in the suggested logic that will append those options back onto the command line if it is the local rank 0.

Finally, change your srun command to simply be:

srun profile-app.sh

So If I have one node per machine, this doesn’t apply to my case, right?
Also, do you have any recommendation to the collection command I am using?

Thanks

Correct, if you only have one process per machine being launched, then there is no need to make modifications to the command line.

With respect to your command line - it looks reasonable, are you getting the results you are hoping for? I think you may need to add --cuda-memory-usage for getting the memory allocation timeline. Also, NIC metrics are only going to be available on NVIDIA network devices.

I added this already, thanks so much!

Regarding the results I am not sure but I have these two concerns:
1- GPU metrics are not displayed after a few seconds:

2- Is there anyway to enhance NIC metrics? Are they being captured even?

Best

One thing to keep in mind is that NIC metrics, like GPU metrics, are system-wide and will include activity that is unrelated to your application.

It appears that there is some IB traffic going on, but if you aren’t seeing communication where you expect it - then it could be that your application is using a different network interface (e.g. Ethernet). You may want to engage your friendly local sysadmin to help you confirm that your job is utilizing the appropriate communication channel.

Now - with respect to the GPU metrics - that is a new one for me. I’m not sure what to suggest specifically, but here are some general steps we can follow:

  1. Start single-node: if you don’t run in parallel, how do things look? This can help rule out anything where in some way the multiple profilers are conflicting with one another.
  2. Shorter session, and less features. Lets confirm that collecting a smaller amount of data goes smoothly. Try something like -s none -t cuda,nvtx --gpu-metrics-device=all --gpu-metrics-frequency=100
  3. Make sure you are using an up-to-date version of the cuda driver and nsys.

hello, this answer is very useful. I can obtain these indicators in the docker environment, but it is only one node GPU indicators. How can I obtain indicators for multiple nodes?

this is command in shell

nsys profile -s none -t cuda,nvtx,osrt --gpu-metrics-device=all --force-overwrite=true --nic-metrics=true \
--capture-range=cudaProfilerApi --capture-range-end=stop -o logs/gpt2_%p.nsys-rep \
    $torchrun_cmd

Thanks

Hello @tfmxtgx0394,

The OP here was using Slurm, so the srun command is launching the same command on multiple nodes. If you have something else that is launching the parallel work, you will need to make sure that it is running nsys and not just the application binary. Please review the documentation regarding Application Launchers.