If I profile with GPU metrics, I can see NVLINK TX and RX Bandwidth. However I do not understand what “[Requests|Responses] User data” and “[Responses|Responses} Protocol data” represent from the figures below. Can someone kindly explain? No documentation here provides information.
I understand Protocol Data to be protocol overhead, and in these figures, it seems rational to assume that Responses User Data indicate actual data transferred for both TX and RX, but this is inconsistent with the below figures where Requests User Data seems to denote the same quantity.
I don’t quite understand your point. Both [Requests|Responses] User Data indicate actual data transferred, and the expectation is that TX on GPU 0 should match RX on GPU1, assuming there are two GPUs.
Could you elaborate on which tooltips belong to which GPUs?
From the above, the one-sided case shows bandwidth and protocol overhead via Responses[User Data|Protocol Data] while the send/recv case above shows bandwidth and protocol overhead via Requests[User Data|Protocol Data] and yet still records some non-negligible values for Responses[User Data|Protocol Data].
Questions
Can you explain precisely what Responses[User Data|Protocol Data] and Requests[User Data|Protocol Data] mean, specifically the reason for the inconsistency among the above metrics?
Definitively, which of the two indicates bandwidth and protocol overhead?
77-78%: Transmission requested by GPU 1, data received from GPU 0
1%: Transmission requested by GPU 0, data sent to GPU 1
The bottom figures (One-sided Puts):
~75%: Transmission requested by GPU 0, data sent to GPU 1
No data sent from GPU 1 to GPU 0
If the tooltips show sample data at the same time point, the slight discrepancies in throughput (77% vs 78%) are likely due to a phase shift between sampling frequencies on different GPUs.
Percentages in RX/TX Bandwidth rows add up to 100%. Your recent figures indicate total NVLink utilization of 98%, 97%, 84.7%, 84.8% respectively.
NVLINK has separate links for transmit (nvltx) and receive (nvlrx).
NVLINK metrics distinguish between request and response, so a developer can determine if traffic is initiated by the observed GPU or from a remote GPU.
NVLINK transfers data in 16 byte flits. The counters are converted from flits to bytes by the metrics library multiplying by 16 bytes/flit. NVLINK can collapse protocol and data into a single packet. I believe in this case it will show as user data, not as protocol (but I could be wrong and I don’t have a setup to test at the moment).
In the table below 1 means 1 flit so final metric (__bytes) is multiplied by 16.
EXAMPLE 1 : GPU0 WRITES TO GPU1
-- gpu0 sends write command and data
1 gpu0.nvltx__bytes_packet_request_data_protocol 0 or >0 depending on complexity of write
2 gpu0.nvltx__bytes_packet_request_data_user >= 1 depending on the size of the transfer
-- gpu1 receives write command and data
3 gpu1.nvlrx__bytes_packet_request_data_protocol 0 or >0 depending on complexity of write -- matching 1
4 gpu1.nvlrx__bytes_packet_request_data_user >= 1 depending on the size of the transfer -- matching 2
-- gpu1 sends write acknowledge -- these can be coalesced if there is no error
5 gpu1.nvltx__bytes_packet_respone_data_protocol 1
-- gpu0 receives write acknowledge
6 gpu0.nvlrx__bytes_packet_response_data_protocol 1
EXAMPLE 2 : GPU0 READS FROM GPU1
- gpu0 sends read command
1 gpu0.nvltx__bytes_packet_request_data_protocol >= 1 including address and length
- gpu1 receives read command
2 gpu1.nvlrx__bytes_packet_request_data_protocol >=1 - matching 1
- gpu1 sends read response
3 gpu1.nvltx__bytes_paket_response_data_protocol >= 0 - may be collapsed into data response
4 gpu1.nvltx__bytes_paket_response_data_user >= 1, based upon size of requested data
-- gpu0 receives read data from gpu1
5 gpu1.nvlrx__bytes_paket_response_data_protocol >= 0 - may be collapsed into data response
6 gpu1.nvlrx__bytes_paket_response_data_user >= 1, based upon size of requested data
Can you explain precisely what Responses[User Data|Protocol Data] and R> equests[User Data|Protocol Data] mean, specifically the reason for the inconsistency among the above metrics?
Inconsistency can exist as (a) the sample period is misaligned, or (b) the clock rate at the perfmon is slightly different rate.
Definitively, which of the two indicates bandwidth and protocol overhead?
For a GPU
bytes received = nvlrx__bytes
in UI sum NVLINK RX* % to get % of theoretical receive bytes - these will not exceed 100%
bytes transmitted = nvtx__bytes
in UI sum NVLINK RX* % to get % of theoretical receive bytes - these will not exceed 100%
The current NSYS UI only shows you as % of the maximum throughput during the sample period. If you know the theoretical maximum from the white paper you can convert to throughput (bytes/sec).
There are additional public metrics for a custom configuration file that could output in terms of bytes or bytes/second.
nvl{rx,tx}__bytes.sum - bytes in sample period
nvl{rx,tx}__bytes.sum.per_second – throughput in bytes/second
nvl{rx,tx}_bytes_data{user, protocol}.sum.per_second – throughput in bytes/second broken down by user and protocol