How to measure the performance of NVLINK while I running HPL

Hello,

I’m now using (evaluating) DGX-A100 with NVIDIA HPC-Benchmarks Container.
While I run HPL (using multiple GPUs), I want to know about the performance (bandwith) of nvlink.

I tried “nvidia-smi nvlink -gt d -i 0” (optionally plus -i device#), but it does not show any change in throughput counters before and after running HPL.

before HPL run ------------------------
2021. 07. 19. (mon) 17:45:00 KST
GPU 0: A100-SXM-80GB (UUID: )
Link 0: Data Tx: 991870746 KiB
Link 0: Data Rx: 994428615 KiB
Link 1: Data Tx: 991828209 KiB
Link 1: Data Rx: 994386931 KiB

Link 11: Data Tx: 990447250 KiB
Link 11: Data Rx: 993019367 KiB

after HPL run ------------------------
2021. 07. 19. (mon) 17:46:15 KST
GPU 0: A100-SXM-80GB (UUID: )
Link 0: Data Tx: 991870746 KiB
Link 0: Data Rx: 994428615 KiB
Link 1: Data Tx: 991828209 KiB
Link 1: Data Rx: 994386931 KiB

Link 11: Data Tx: 990447250 KiB
Link 11: Data Rx: 993019367 KiB

---- HPL run result in docker ---------------------------------
2021-07-19 08:46:10.996
T/V N NB P Q Time Gflops
WRxxxxxx 5xxxxx 2xx x x2 9.96 1.306e+04
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.xxxxxxxx … PASSED

===============
For short evaluation, the HPL runs approx. 10s using 8x A100 in DGX-A100.

The “nvidia-smi nvlink -gt d -i 0” command was exeuted outside of docker.

For comparison, I also tried the “p2pBandwidthLatencyTest” in sample code.
After running “p2pBandwidthLatencyTest”, the nvlink perf. counter changed.

Before “p2pBandwidthLatencyTest” --------------------------------
GPU 1: A100-SXM-80GB (UUID: )
Link 0: Data Tx: 992685174 KiB
Link 0: Data Rx: 992327236 KiB
Link 1: Data Tx: 992642624 KiB
Link 1: Data Rx: 992284697 KiB

Link 11: Data Tx: 991270582 KiB
Link 11: Data Rx: 990903740 KiB
After “p2pBandwidthLatencyTest” --------------------------------
GPU 1: A100-SXM-80GB (UUID: )
Link 0: Data Tx: 993596626 KiB
Link 0: Data Rx: 993238688 KiB
Link 1: Data Tx: 993554076 KiB
Link 1: Data Rx: 993196150 KiB

Link 11: Data Tx: 992182035 KiB
Link 11: Data Rx: 991815192 KiB

The run results of “p2pBandwidthLatencyTest” shows that the nvlink worked.
But I don’t know why the nvlink perf. counter value of HPL running did not changed.
(I think the HPL may use nvlink for better performance)

Here is my questions:
How can I check the throughtput of nvlink using NVIDIA HPL doceker?
Does my approach right for measuring the performance of nvlink?
Or, does the HPL (in container) utilize nvlink or not?