I am using Nsight System 5.1 in the nvcr.io/nvidia/pytorch:20.12-py3 docker container.
The MLNX_OFED_LINUX-5.2-188.8.131.52 is installed in the docker container.
I was profiling a pytorch DDP job on 2 NVIDIA DGX A100 machines.
There was no error messages during running. But error messages appear in the nsys-rep files when I check them in the GUI. Everything is fine except that NIC metric data was not collected.
Hello, could you give me the exact command line you used (it should be in the analysis summary view if you do not remember).
Also, would you feel comfortable sending us the .nsys-rep file?
@ytebeka you may want to take a look at this.
Yep, it would be nice to:
a. See the command line that you used
b. Have a look at a generated .nsys-rep file, if it is possible.
Hi @hwilper @ytebeka . Thanks for your quick response.
The command line I used was like this:
/opt/nvidia/nsight-systems/2021.5.1/bin/nsys profile \
-t cuda,cudnn,nvtx \
-o $reportName \
--force-overwrite true \
--gpu-metrics-device 0 \
--nic-metrics true \
python3 -m torch.distributed.launch \
--nproc_per_node 8 \
--nnodes $n_node \
--node_rank $rank \
--master_addr $addr \
--master_port $port \
And I send the nsys-rep file to @ytebeka by message.
Thanks for your attention to this
Hi ytebeka. Did you read my message? Should I upload the nsys-rep file here?
Sorry for not answering earlier, I was on a short vacation.
I got the nsy-rep file and it indeed does not contain NIC metrics.
I will use the container you used and will try to reproduce the problem.
I’ll update here when I will have results.
Hi @ytebeka thanks for your response. As for the details of the container, please refer to
superbenchmark/cuda11.1.1.dockerfile at main · microsoft/superbenchmark (github.com)
I hope this would be helpful. Thanks for your attention and help.
A fix for the described problem is planned to be added to the next Nsight Systems release.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.