I also try to use nvml api to do this. But I found that the api nvmlDeviceGetNvLinkUtilizationCounter has this parameter unsigned int counter. How can I determine the value of this parameter?
Or is there any way I can get the throughput of NVLINK?
Any help would be appreciated.
Here is my GPU info:
nvlink --capabilities -i 0:
GPU 0: NVIDIA A100-SXM4-80GB
Link 0, P2P is supported: true
Link 0, Access to system memory supported: true
Link 0, P2P atomics supported: true
Link 0, System memory atomics supported: true
Link 0, SLI is supported: false
Link 0, Link is supported: false
this is a subcommand within nvidia-smi, it should be:
nvidia-smi nvlink ...
the argument 0 is not correct. Referring to nvidia-smi command line help, we see that valid arguments are either d or r:
$ nvidia-smi nvlink --help
nvlink -- Display NvLink information.
Usage: nvidia-smi nvlink [options]
Options include:
[-h | --help]: Display help information
[-i | --id]: Enumeration index, PCI bus ID or UUID.
[-l | --link]: Limit a command to a specific link. Without this flag, all link information is displayed.
[-s | --status]: Display link state (active/inactive).
[-c | --capabilities]: Display link capabilities.
[-p | --pcibusid]: Display remote node PCI bus ID for a link.
[-R | --remotelinkinfo]: Display remote device PCI bus ID and NvLink ID for a link.
[-sc | --setcontrol]: Setting counter control is deprecated!
[-gc | --getcontrol]: Getting counter control is deprecated!
[-g | --getcounters]: Getting counters using option -g is deprecated.
Please use option -gt/--getthroughput instead.
[-r | --resetcounters]: Resetting counters is deprecated!
[-e | --errorcounters]: Display error counters for a link.
[-ec | --crcerrorcounters]: Display per-lane CRC error counters for a link.
[-re | --reseterrorcounters]: Reset all error counters to zero.
[-gt | --getthroughput]: Display link throughput counters for specified counter type
The arguments consist of character string representing the type of traffic counted:
d: Display tx and rx data payload in KiB
r: Display tx and rx data payload and protocol overhead in KiB if supported
When I run a command like that, I get this output:
$ nvidia-smi nvlink -gt d
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-ada9b876-62e3-2837-f48f-e451080db283)
Link 0: Raw Tx: 49596123 KiB
Link 0: Raw Rx: 49596040 KiB
Link 1: Raw Tx: 49596123 KiB
Link 1: Raw Rx: 49596040 KiB
Link 2: Raw Tx: 49596081 KiB
Link 2: Raw Rx: 49595840 KiB
Link 3: Raw Tx: 49596081 KiB
Link 3: Raw Rx: 49595840 KiB
Link 4: Raw Tx: 49596104 KiB
Link 4: Raw Rx: 49596252 KiB
Link 5: Raw Tx: 49596104 KiB
Link 5: Raw Rx: 49596252 KiB
Link 6: Raw Tx: 49596104 KiB
Link 6: Raw Rx: 49596252 KiB
Link 7: Raw Tx: 49596104 KiB
Link 7: Raw Rx: 49596252 KiB
Link 8: Raw Tx: 49596123 KiB
Link 8: Raw Rx: 49596040 KiB
Link 9: Raw Tx: 49596123 KiB
Link 9: Raw Rx: 49596040 KiB
Link 10: Raw Tx: 49596081 KiB
Link 10: Raw Rx: 49595840 KiB
Link 11: Raw Tx: 49596081 KiB
Link 11: Raw Rx: 49595840 KiB
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-0dc422dd-e02d-00ac-7aa3-666d573295e9)
Link 0: Raw Tx: 49596000 KiB
Link 0: Raw Rx: 49595840 KiB
Link 1: Raw Tx: 49596000 KiB
Link 1: Raw Rx: 49595840 KiB
Link 2: Raw Tx: 49596000 KiB
Link 2: Raw Rx: 49595840 KiB
Link 3: Raw Tx: 49596000 KiB
Link 3: Raw Rx: 49595840 KiB
Link 4: Raw Tx: 49596045 KiB
Link 4: Raw Rx: 49596128 KiB
Link 5: Raw Tx: 49596045 KiB
Link 5: Raw Rx: 49596128 KiB
Link 6: Raw Tx: 49596023 KiB
Link 6: Raw Rx: 49596253 KiB
Link 7: Raw Tx: 49596023 KiB
Link 7: Raw Rx: 49596253 KiB
Link 8: Raw Tx: 49596045 KiB
Link 8: Raw Rx: 49596128 KiB
Link 9: Raw Tx: 49596045 KiB
Link 9: Raw Rx: 49596128 KiB
Link 10: Raw Tx: 49596023 KiB
Link 10: Raw Rx: 49596253 KiB
Link 11: Raw Tx: 49596023 KiB
Link 11: Raw Rx: 49596253 KiB
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-e2844c6d-e4fb-d4cb-550a-d043c4c8864d)
Link 0: Raw Tx: 49596257 KiB
Link 0: Raw Rx: 49596028 KiB
Link 1: Raw Tx: 49596257 KiB
Link 1: Raw Rx: 49596028 KiB
Link 2: Raw Tx: 49596232 KiB
Link 2: Raw Rx: 49595841 KiB
Link 3: Raw Tx: 49596232 KiB
Link 3: Raw Rx: 49595841 KiB
Link 4: Raw Tx: 49596261 KiB
Link 4: Raw Rx: 49596113 KiB
Link 5: Raw Tx: 49596261 KiB
Link 5: Raw Rx: 49596113 KiB
Link 6: Raw Tx: 49596261 KiB
Link 6: Raw Rx: 49596113 KiB
Link 7: Raw Tx: 49596261 KiB
Link 7: Raw Rx: 49596113 KiB
Link 8: Raw Tx: 49596257 KiB
Link 8: Raw Rx: 49596028 KiB
Link 9: Raw Tx: 49596257 KiB
Link 9: Raw Rx: 49596028 KiB
Link 10: Raw Tx: 49596232 KiB
Link 10: Raw Rx: 49595841 KiB
Link 11: Raw Tx: 49596232 KiB
Link 11: Raw Rx: 49595841 KiB
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-cad33e39-970d-d3fe-9ab8-ad2b4bcbeb57)
Link 0: Raw Tx: 49595848 KiB
Link 0: Raw Rx: 49596008 KiB
Link 1: Raw Tx: 49595848 KiB
Link 1: Raw Rx: 49596008 KiB
Link 2: Raw Tx: 49595848 KiB
Link 2: Raw Rx: 49596008 KiB
Link 3: Raw Tx: 49595848 KiB
Link 3: Raw Rx: 49596008 KiB
Link 4: Raw Tx: 49595851 KiB
Link 4: Raw Rx: 49596093 KiB
Link 5: Raw Tx: 49595851 KiB
Link 5: Raw Rx: 49596093 KiB
Link 6: Raw Tx: 49595844 KiB
Link 6: Raw Rx: 49596235 KiB
Link 7: Raw Tx: 49595844 KiB
Link 7: Raw Rx: 49596235 KiB
Link 8: Raw Tx: 49595851 KiB
Link 8: Raw Rx: 49596093 KiB
Link 9: Raw Tx: 49595851 KiB
Link 9: Raw Rx: 49596093 KiB
Link 10: Raw Tx: 49595844 KiB
Link 10: Raw Rx: 49596235 KiB
Link 11: Raw Tx: 49595844 KiB
Link 11: Raw Rx: 49596235 KiB
Also note that the availability of the -gt command is dependent on using a “recent” driver. Older drivers use the older -g syntax. Refer to the command line help for the nvidia-smi installed on your system to learn which is applicable, or update your GPU driver.