Probing NVlinks in HGX2

Dear,

We are running experiments on a HGX-2 (16 * V100 GPUs) system, only part of the whole machine (4 * V100 GPUs) are exposed via a docker container.
We did observe nvlink activities (send and receive bytes) by profiling applications with NCU, however, the “nvlink_table and nvlink_topology” sections both gave output “Not supported for NVSwitch.”
I believe that the V100 GPUs in the HGX2 are indeed connected via nvswitches, so do you know what does this mean?

Moreover, it seems that the NCU profilies for nvlink activities in a very high level, meaning that it samples every kernel, and also do not record the source and destination of packets, is that true?

Best regards,
Z.

Hi, @ziyue.zhang

The output means Nsight Compute do not support nvswitch related analysis. This is restriction of the tool now.

Hi, Thanks for the reply.

Does this mean that nsight compute don’t have the ability to trace the exact source and destination of data stream among GPUs? If so, do you know any other tool that has this capability?

Thanks!

Yes. Your understanding is right. This is a known limitation of nvlink profiling currently.

For the question about other tool, I check internally, and was told that Nsight System also provide nvlink metrics, you can refer to User Guide — nsight-systems 2024.1 documentation, but it only supports on Turing+ GPU. And the metric is not per link, but all links on the gpu aggregated.