I am working with PyTorch for distributed training and using A100 GPUs for doing this. One weird issue that I am observing is that as I increase the CUDA_DEVICE_MAX_CONNECTIONS, the overlap of comms and compute increases, BUT so does the PCIE traffic from device to host. Reducing the CUDA_DEVICE_MAX_CONNECTIONS to 1 ensures no PCIE traffic (even on single node), but reduces overlap in comms and compute.

Can someone help me understand what variables are causing this and why this specific one drives the PCIE traffic in this manner?

p.s. The topology is that each node has 8A100s and are connected by NVLink in 4x2 config (4 GPUs fully interlinked with NVLink and these two islands connected by a hub).

1 Like

CUDA_DEVICE_MAX_CONNECTIONS is an environment variable that (roughly) defines the number of hardware queues that CUDA streams will be able to utilize or map into. When you have more streams than queues, the streams will alias onto the queues.

I’m not really sure how to interpret that statement. To a first order approximation, no PCIE traffic would mean that the GPUs cannot be utilized. All CUDA activity in current era begin with transfer of data from host to device via PCIE. So I don’t really suppose that you mean no PCIE traffic globally, across the entire application execution.

Modifying the variable does not seem to me like it would affect PCIE communications, however setting it to 1 may have some unusual side-effects in a system with both PCIE and NVLink connections; I have not experimented with it to that level. For example, if a hardware copy queue is associated with a NVLink connection, and you reduced the number of HW queues down to just that one, I’m not sure what would happen.

1 Like

That’s correct, pcie traffic exists from host to device. The issue is that the device to host is unusually high, and we cannot explain it. It only occurs when PyTorch does overlap of computation and communication. It is not occurring when we turn that off or set cuda launch blocking to True.

What is also interesting is that when we run all reduce or all gather at nccl level or at PyTorch distributed, the device to host traffic is tiny. However, when we run an actual training job, the device to host traffic is unusually high - order of GBps.

More info is documented here: Unexpected High PCIe traffic in Distributed Training since PT 2 · Issue #103254 · pytorch/pytorch · GitHub

I am trying to figure out if it is something fundamental to our Infra or if it is a gpu configuration issue.

I’m not really an expert on PyTorch or PyTorch distributed.

NVLink will get utilized for device-to-device transfers in CUDA when peer activity is enabled (cudaDeviceEnablePeerAccess() for example). Your system topology seems a little weird to me. Normally I would expect 8x A100 with NVLink connectivity to be an SXM design which would normally be a design with a NVLink switch. In that case, there are no “islands” as far as NVLink is concerned. The other possibility I can think of is PCIE A100 with NVLink bridges, but in that case I wasn’t aware of a 4-way bridge arrangement. Maybe you have some sets of pairwise connections, not sure.

In any event, if the underlying code is enabling peer access, and a direct NVLink connection exists, a transfer should flow over NVLink. Otherwise the transfer will flow via a device->host->device path, which would of course cause your observation of device->host PCIE traffic to increase. (If no direct NVLink connection exists, but peer access is properly enabled, the transfer will flow device->device but using PCIE, so depending on how you are measuring PCIE traffic and your exact topology, it might cause device->host traffic, or it might not). If you really do have 2 islands, then traffic between the two islands would necessarily involve some device->host PCIE activity. And to some degree, NCCL can identify NVLink paths and choose those to do collective operations, to some degree avoiding unnecessary PCIE traffic, even if it means multiple NVLink hops.

So there is potentially a lot of complexity here. To properly understand it, you really need a very precise description of the topology, as well as a granular description of the transfers that are taking place or requested.

1 Like

@Robert_Crovella What will be an easy way to verify the paths? Perhaps a simple async all reduce and a matrix multiply?

I checked with my infra team and the topology is indeed 8xA100 connected by NVFabric switch. Please bear with me as I am more on the software/application layer and trying to get deeper into this unexplained traffic issue (the traffic exists only on our hardware and hence I believe it is either an infra issue or a configuration).

Yes, it would make sense to start with a simple test case.

Is there any way to manually control a certain stream maps to which queue?

not that I know of.