Multi Node Profiling with Nsight Systems

osamaabuhamdan · June 26, 2024, 6:19pm

Greetings,

I want to know what are the best practices to profile a PyTorch multi-node training job with SLURM. I am interested in:

Communication and synchronization between nodes
GPU memory utilization
NIC metrics and data exchange
Event timeline (something similar to Horovod timeline).

For now I am using this command

srun nsys profile -w true --trace=cuda,nvtx,osrt,cudnn,cublas,mpi --cudabacktrace=true -x true --gpu-metrics-device=all --sample=none --force-overwrite=true --nic-metrics=true \
    --duration=120 --capture-range=cudaProfilerApi --output=logs/${EXP_NAME}/%q{SLURM_NODEID}_%q{SLURM_PROCID}.nsys-rep \
    $torchrun_cmd

Moreover, I watched a YouTube video and it mentions this in the slides

I am not sure what does it mean and how to write the command to avoid this.

Best Regards

mhallock · June 26, 2024, 7:00pm

Greetings,

The video is suggesting to move your nsys execution into a script, so that you can modify the command line arguments based on if it is the “local rank 0”. The rationale for that is because if you have multiple ranks that are running on the same physical machine, the run will fail because GPU metrics collection can only be done by one process at a time. NIC metrics are also a “system-wide” collection, so there is no reason to collect it multiple times by each rank on the machine.

Make a script (e.g. profile-app.sh) and put the nsys profile .... command in it, but without --nic-metrics or --gpu-metrics-device. Then include in the suggested logic that will append those options back onto the command line if it is the local rank 0.

Finally, change your srun command to simply be:

srun profile-app.sh

osamaabuhamdan · June 26, 2024, 7:38pm

So If I have one node per machine, this doesn’t apply to my case, right?
Also, do you have any recommendation to the collection command I am using?

Thanks

mhallock · June 27, 2024, 9:18pm

Correct, if you only have one process per machine being launched, then there is no need to make modifications to the command line.

With respect to your command line - it looks reasonable, are you getting the results you are hoping for? I think you may need to add --cuda-memory-usage for getting the memory allocation timeline. Also, NIC metrics are only going to be available on NVIDIA network devices.

osamaabuhamdan · June 28, 2024, 12:45am

I added this already, thanks so much!

Regarding the results I am not sure but I have these two concerns:
1- GPU metrics are not displayed after a few seconds:

2- Is there anyway to enhance NIC metrics? Are they being captured even?

Best

mhallock · June 28, 2024, 3:13pm

One thing to keep in mind is that NIC metrics, like GPU metrics, are system-wide and will include activity that is unrelated to your application.

It appears that there is some IB traffic going on, but if you aren’t seeing communication where you expect it - then it could be that your application is using a different network interface (e.g. Ethernet). You may want to engage your friendly local sysadmin to help you confirm that your job is utilizing the appropriate communication channel.

Now - with respect to the GPU metrics - that is a new one for me. I’m not sure what to suggest specifically, but here are some general steps we can follow:

Start single-node: if you don’t run in parallel, how do things look? This can help rule out anything where in some way the multiple profilers are conflicting with one another.
Shorter session, and less features. Lets confirm that collecting a smaller amount of data goes smoothly. Try something like -s none -t cuda,nvtx --gpu-metrics-device=all --gpu-metrics-frequency=100
Make sure you are using an up-to-date version of the cuda driver and nsys.

tfmxtgx0394 · July 3, 2024, 4:09am

hello, this answer is very useful. I can obtain these indicators in the docker environment, but it is only one node GPU indicators. How can I obtain indicators for multiple nodes？

this is command in shell

nsys profile -s none -t cuda,nvtx,osrt --gpu-metrics-device=all --force-overwrite=true --nic-metrics=true \
--capture-range=cudaProfilerApi --capture-range-end=stop -o logs/gpt2_%p.nsys-rep \
    $torchrun_cmd

Thanks

mhallock · July 8, 2024, 7:18pm

Hello @tfmxtgx0394,

The OP here was using Slurm, so the srun command is launching the same command on multiple nodes. If you have something else that is launching the parallel work, you will need to make sure that it is running nsys and not just the application binary. Please review the documentation regarding Application Launchers.

Topic		Replies	Views
Nsys for multi GPU apps Profiling Linux Targets	1	1376	September 10, 2018
Error Collecting Nsys Profile Metrics Profiling Linux Targets nsight	3	728	April 18, 2024
Cannot profile on slurm environment Nsight Compute	2	744	August 17, 2022
[problem] Nsight System cannot collect program performance data in a multi-node distributed environment Profiling Linux Targets	4	869	April 20, 2023
Multi-GPU debug CUDA Programming and Performance	0	731	February 18, 2013
Nsys profile with horovod leading to GPU stalling for multiple GPUs (A100) Profiling Linux Targets nsight	1	1647	November 20, 2021
Profiling across multiple mpi machines Nsight Systems	0	440	March 11, 2021
Nsys or nsight-cu-cli, how to get metrics Profiling Linux Targets	1	646	May 20, 2020
Parallel Nsight CUDA Programming and Performance	0	651	May 18, 2011
NIC metric data was not collected Profiling Linux Targets	8	1226	January 25, 2022

Multi Node Profiling with Nsight Systems

Related topics