The video is suggesting to move your nsys execution into a script, so that you can modify the command line arguments based on if it is the “local rank 0”. The rationale for that is because if you have multiple ranks that are running on the same physical machine, the run will fail because GPU metrics collection can only be done by one process at a time. NIC metrics are also a “system-wide” collection, so there is no reason to collect it multiple times by each rank on the machine.
Make a script (e.g. profile-app.sh) and put the nsys profile .... command in it, but without --nic-metrics or --gpu-metrics-device. Then include in the suggested logic that will append those options back onto the command line if it is the local rank 0.
Correct, if you only have one process per machine being launched, then there is no need to make modifications to the command line.
With respect to your command line - it looks reasonable, are you getting the results you are hoping for? I think you may need to add --cuda-memory-usage for getting the memory allocation timeline. Also, NIC metrics are only going to be available on NVIDIA network devices.
One thing to keep in mind is that NIC metrics, like GPU metrics, are system-wide and will include activity that is unrelated to your application.
It appears that there is some IB traffic going on, but if you aren’t seeing communication where you expect it - then it could be that your application is using a different network interface (e.g. Ethernet). You may want to engage your friendly local sysadmin to help you confirm that your job is utilizing the appropriate communication channel.
Now - with respect to the GPU metrics - that is a new one for me. I’m not sure what to suggest specifically, but here are some general steps we can follow:
Start single-node: if you don’t run in parallel, how do things look? This can help rule out anything where in some way the multiple profilers are conflicting with one another.
Shorter session, and less features. Lets confirm that collecting a smaller amount of data goes smoothly. Try something like -s none -t cuda,nvtx --gpu-metrics-device=all --gpu-metrics-frequency=100
Make sure you are using an up-to-date version of the cuda driver and nsys.
hello, this answer is very useful. I can obtain these indicators in the docker environment, but it is only one node GPU indicators. How can I obtain indicators for multiple nodes?
The OP here was using Slurm, so the srun command is launching the same command on multiple nodes. If you have something else that is launching the parallel work, you will need to make sure that it is running nsys and not just the application binary. Please review the documentation regarding Application Launchers.