Slow memcpy performance in dual-CPU, 10 GPU system

I’ve been trying to diagnose some difficult performance-related trouble in a dual-CPU, 10x RTX A4000 system. There seem to be multiple issues that cause lower than expected performance (see my earlier topic: Multi-GPU contention inside CUDA). After a lot of debugging I’ve identified one of the underlying problems, which is related to memory copies (both H->D, D->H). In general, memory throughput is much lower than expected as the load increases. The workload is a TensorRT model that is driven from two threads per GPU (20 threads in total in this case).

Low concurrency, low load

This is part of a trace that runs on 1 GPU, with a very light CPU and GPU utilization:

image

GPU utilization is between 30% and 70%, CPU less than 5%.

The red parts show D->H copies and the throughput varies a bit, the slowest one is 2.4GiB/s, the fastest is 6.5GiB/s. Though a bit flaky, these results are generally within expected parameters (I think).

High concurrency, medium load

This is the trace I get when running medium load, distributed over 10 GPUs (used MPS for this to rule out GPU context switching):

image

Note that the slowest copy is now around 690MiB/s, the fastest around 4.7GiB/s (we’re already seeing decreased memory transfer speeds, even though the load per GPU is less compared to the previous example).

Notes:

  • GPU utilization is between 20% and 50%, hardly ever above 50%.
  • CPU utilization is around 25%.

High concurrency, high load

Now, when trying to achieve full load (again 10 GPUs, with MPS), I get really bad results, for example:

image

Slowest copy throughput is 68MiB/s, fastest is around 2GiB/s. But a lot of the transfers are closer to the slow range. The expected throughput is around 5GiB/s (as tested on a lot of other non dual-CPU, non multi-GPU systems). In the above scenario, most copies are around the 200MiB/s range, severely slowing down inference and preventing fully utilizing the GPU. The CPU seems overloaded in this case, even though the load is around twice as much compared to the previous one (?).

Notes:

  • GPU utilization is between 0% and 40%.
  • CPU utilization is around 90%-100%. (load avg. is 120/80)

System details

CPU:
  Info: 2x 20-core model: Intel Xeon Silver 4316 bits: 64 type: MT MCP SMP cache:
    L2: 2x 25 MiB (50 MiB)
  Speed (MHz): avg: 916 min/max: 800/3400 cores: 1: 801 2: 801 3: 800 4: 800 5: 800 6: 800
    7: 801 8: 801 9: 801 10: 801 11: 800 12: 801 13: 800 14: 801 15: 801 16: 800 17: 801 18: 801
    19: 800 20: 801 21: 801 22: 800 23: 801 24: 801 25: 800 26: 801 27: 801 28: 801 29: 2302
    30: 801 31: 801 32: 1890 33: 801 34: 801 35: 1085 36: 1403 37: 801 38: 1413 39: 1243 40: 801
    41: 800 42: 801 43: 801 44: 801 45: 801 46: 801 47: 801 48: 801 49: 801 50: 1648 51: 800
    52: 800 53: 800 54: 2301 55: 800 56: 800 57: 801 58: 801 59: 801 60: 801 61: 801 62: 801
    63: 801 64: 802 65: 800 66: 801 67: 801 68: 801 69: 1416 70: 801 71: 801 72: 1758 73: 800
    74: 800 75: 846 76: 801 77: 801 78: 1440 79: 897 80: 801
Graphics:
  Device-1: ASPEED Graphics Family driver: ast v: kernel
  Device-2: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-3: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-4: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-5: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-6: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-7: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-8: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-9: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-10: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Device-11: NVIDIA GA104GL [RTX A4000] driver: nvidia v: 525.60.11
  Display: server: No display server data found. Headless machine? tty: 238x51

Driver, CUDA, cuDNN, TRT versions

 * NVIDIA Driver with version 525.60
 * CUDA Toolkit with version 11.8 at /usr/local/cuda-11.8
 * cuDNN with version 8.7.0 at /usr
 * TensorRT with version: 8.5.2.2

System topology matters. You can’t have all GPUs copy in the same direction (e.g. D->H) and expect peak/maximum/“normal” throughput. There are inevitably bottlenecks (shared pipes) in such a system. Considerations of utilization don’t really matter here. What matters is the concurrency (or not) of transfers in the same direction and the system topology. PCIE link speed matters as well, as do whether or not transfers are involving pinned memory or not.

That makes sense, but even when assuming a bottleneck, the transfer speeds I’m seeing are low. Let’s say there is some bottleneck in transfer speeds, that imposes a hard limit on the maximum concurrent bandwidth. The worst single-GPU speed I’ve seen is around 2.4GiB/s. With 10 GPUs at the same time, there could be (worst case), a hard limit at around 240MiB/s. I’m seeing transfer speeds as low as 70MiB/s, and even 12MiB/s if I look through the entire trace.

Another one of the bad traces:

image

It is interesting that it seems only half the transfers are affected by this issue. Note that since I’m running only one model, all the red parts represent the same amount of data (around 21MB). Some are relatively fast, some are slow. And if they’re slow, they’re really slow.

It almost seems like a game of chance whether a cudaMemcpy is going to be fast or slow.

small/short transfers can easily drop down to arbitrarily low levels of performance. The peak transfer speeds will only be seen for transfers that are megabytes in size or larger, and are to/from pinned memory.

I believe bandwidthTest runs a 32MB transfer by default. 21MB might be sliding down a bit on the efficiency curve.

Another factor that can affect apparent H<->D speeds in a 2 socket server is process placement/affinity. Transfers to a non “affine” GPU may be additionally slower when travelling over the intersocket link. concurrency over this link will also reduce perf.

small/short transfers can easily drop down to arbitrarily low levels of performance. The peak transfer speeds will only be seen for transfers that are megabytes in size or larger, and are to/from pinned memory.

I believe bandwidthTest runs a 32MB transfer by default. 21MB might be sliding down a bit on the efficiency curve.

Can these factors combined could cause the combined memory transfer speeds (D->H or H->D) of 10 GPUs to drop below the bandwidth of a single GPU?

Another factor that can affect apparent H<->D speeds in a 2 socket server is process placement/affinity. Transfers to a non “affine” GPU may be additionally slower when travelling over the intersocket link. concurrency over this link will also reduce perf.

I assumed that MPS would handle affinity internally in the multi-GPU case. Is there any way to configure MPS to handle this?

Sure, I could concoct a case of very small transfers from multiple GPUs simultaneously that drops to arbitrarily low levels, much slower than 0.1 x peak

No, that is not part of MPS. MPS has no control over process placement.

Sure, I could concoct a case of very small transfers from multiple GPUs simultaneously that drops to arbitrarily low levels, much slower than 0.1 x peak

Okay and does 21MB count as small? Because all the traces I’ve posted above the D->H transfers are 21MB (that is the model output).

I tried to reproduce the issue by trying to saturate the all 10 GPUs using the bandwidth example but I was only able to see transfer speeds drop by a factor 0.5. When stressing the CPU I got 0.1x in some cases but not as bad as the traces. I’ll get back here with some data and a code example.

I don’t think 21MB is small. bandwidthTest gives you enough controls to characterize it if you wish. Linux also has process placement tools, so you could use something like numactl or taskset to additionally test the effect of affinity.

I’ve been trying to reproduce this issue and I think I’ve gotten somewhere. I’m using bandwidthTest.

Single GPU case

To start, I ran this command to get a baseline reading on a single GPU:

/usr/local/cuda/extras/demo_suite/bandwidthTest --device=0 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh

The results are:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA RTX A4000
 Range Mode

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			5125.6
   21000000			5187.4
   22000000			5186.1
   23000000			5184.9
   24000000			5177.7
   25000000			5188.1
   26000000			5182.6
   27000000			5191.7
   28000000			5182.7
   29000000			5192.6
   30000000			5191.4

Result = PASS

So around 5.2 GiB/s which is great.

Multi-GPU case

Then, I ran the test for each device simultaneously using this script:

/usr/local/cuda/extras/demo_suite/bandwidthTest --device=0 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=1 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=2 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=3 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=4 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=5 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=6 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=7 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=8 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=9 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh

It’s not perfect but a quick glance at nvidia-smi shows that at least 8 GPUs were around 40% at the same time so there’s definitely some overlap and the result show this as well:

(Just showing part of it because there is so much output:)

   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			4280.2
   21000000			3949.6
   22000000			3902.3

Another one looks like this:

   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			4448.5
   21000000			4516.0
   22000000			4628.8

Generally, I’m getting anywhere from 3.5GiB/s up to 5GiB/s.

Multi-GPU with high CPU load

I did a simple stress test with stress --cpu 160 -t 60 (all cores will be doing sqrt on repeat) and ran the multi GPU test again, and got:

   20000000			2589.2
   21000000			2157.9
   22000000			2225.7
   23000000			1937.3

...

   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			1967.4
   21000000			1592.4
   22000000			1618.9

The range is between 1.5GiB/s up to around 4GiB/s. Already we’re seeing a decrease in transfer speeds due to (unrelated) CPU load.

Multi-GPU with RAM load

And again with stress --vm 160 -t 60 which will do malloc free in a loop on all cores:

Snippets from the results:

   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			1037.3
   21000000			828.2
   22000000			842.9

...
   20000000			1618.3
   21000000			1476.4
   22000000			1139.5
   23000000			955.0

So CPU usage seems to play a role here. I still did not get 12MiB/s transfer speeds but I’m getting closer to what I’m seeing in the traces.

Do you have any idea how CPU load might play a role in this? Is it simply that the OS scheduler is not giving CUDA enough time to fetch data from the device?

Curiously, when tracing the bandwidthTest, nsys reports varying throughput (just like the traces from before) ranging from 200MiB/s to 5GiB/s on the high end. This seems very similar to what I’m seeing with TensorRT. bandwidthTest itself reports more uniform speeds, maybe because it averages a bunch of transfers?

This copy for example has a throughput of around 285MiB/s according to nsys:

Screenshot from 2023-01-17 19-27-04

A host to device or device to host transfer uses up bandwidth, at both ends. It uses bandwidth at the CPU side and it uses bandwidth at the GPU side. Compared to PCIE bandwidth (e.g. ~12GB/s), A4000 device memory bandwidth of ~448 GB/s is probably immaterial.

But I don’t know the bandwidth available on your CPU side.

Speaking for myself, if I were embarking on this investigation, I would certainly get the system topology diagram. Also, for the current direction, the CPU model would be of interest, as well as perhaps memory config (whether all DIMM slots are populated, BIOS settings, etc.)

The transfer you are now showing is a pageable transfer. Those place “additional” stress on the host memory subsystem and may have other issues that impact measurement and performance. Its recommended to used pinned buffers/transfers for best transfer throughput, and its possible that the pageable transfers are contributing to measurement variability and impacting performance.

I just reran the bandwidthTest multi-GPU with pinned instead of pageable and interestingly enough copies are not affected at all by CPU load anymore :). I’m going to try and see if I can get the same effect in TensorRT.

Use of the cudaHostAllocWriteCombined flag may be beneficial, particularly in the H->D direction.

What are the host system specifications? PAGEABLE transfers actually consist of a DMA copy across PCIe (using a pinned system memory buffer owned by the driver), followed by a system memory copy from the pinned buffer to the user process memory. For the kind of transfer sizes shown here, what you should see for a PCIe 3 interconnect on a x16 link is about 13 GB/sec for PINNED transfers (and about twice that for a PCIe 4 interconnect).

The speed of PAGEABLE transfers, which involve an additional copy from system memory to system memory, will depend on the performance of system memory, and the 5.2GB/sec shown here could indicate a bottleneck in system memory. I see more than 5.2 GB/sec bandwidth reported on my slightly older Skylake-based system. How many DDR4 channels per CPU socket are populated in this system, and what speed-grade of DDR4 is being used? A large high-end system (which is what appears to be used here) ideally would use 8 channels of DDR4-3200.

From perusing this thread (i.e. no reading in detail), it seems the system may have multiple GPUs. In this case, make sure that each GPU is on a x16 PCIe link. Does the CPU provide enough PCIe lanes for this? If this is also a multi-CPU system make sure to use processor and memory affinity control (e.g. with numactl) to make sure each GPU is communicating with the “near” CPU and the “near” system memory.

What are the host system specifications?

Xeon dual-CPU and 10 GPUs (see full specs in first post)

PAGEABLE transfers actually consist of a DMA copy across PCIe (using a pinned system memory buffer owned by the driver), followed by a system memory copy from the pinned buffer to the user process memory. For the kind of transfer sizes shown here, what you should see for a PCIe 3 interconnect on a x16 link is about 13 GB/sec for PINNED transfers (and about twice that for a PCIe 4 interconnect).

I’m seeing around 25GiB/s peak, but it does drop to 10MiB/s sometimes in multi-GPU scenarios though.

The speed of PAGEABLE transfers, which involve an additional copy from system memory to system memory, will depend on the performance of system memory, and the 5.2GB/sec shown here could indicate a bottleneck in system memory.

That is actually best case. I’m actually seeing as low as 12MiB/s in traces or 200MiB/s in the testBandwidth test when the host system is under load. As said before though, that might actually be caused by the DMA copy (which is somehow affected by host being busy).

I see more than 5.2 GB/sec bandwidth reported on my slightly older Skylake-based system. How many DDR4 channels per CPU socket are populated in this system, and what speed-grade of DDR4 is being used? A large high-end system (which is what appears to be used here) ideally would use 8 channels of DDR4-3200.

Yes memory is DDR4 32000

From perusing this thread (i.e. no reading in detail), it seems the system may have multiple GPUs. In this case, make sure that each GPU is on a x16 PCIe link.

Yes, each one has their own 16x.

Does the CPU provide enough PCIe lanes for this?

Dual-CPU with 64 lanes per CPU and there’s a PLX that eventually connect every GPU with 16 lanes.

If this is also a multi-CPU system make sure to use processor and memory affinity control (e.g. with numactl) to make sure each GPU is communicating with the “near” CPU and the “near” system memory.

This is a good one and I’ll be trying this out. Still figuring out exactly how to do this with MPS.

So per thread-starting post this system uses dual Intel® Xeon® Silver 4316 Processors, each with 8 DDR4-3200 channels and 64 PCIe 4 lanes. Are all DDR4 channels populated?

64 PCIe channels means there can be no more than four native PCIe 4 x16 links. If the system presents more x16 links than that, the PLX chip would appear to multiplex them onto the physical lanes available.

It should not be so much the host CPU being busy as the system memory being busy (that is where the DMA transfers discussed here terminate). 8 channels of DDR4-3200 provide about 200 GB/sec of usable system memory bandwidth (run the STREAM benchmark to find out what is actually achieved).

Given the number of GPUs in the system, and assuming that they are all working at the same time and that not all GPU/host transfers use PINNED memory, and considering that the software running on the host system needs system memory bandwidth as well, the hypothesis of the system becoming bottlenecked on system memory throughput still seems plausible to me.

When configuring a system, keep in mind that PCIe is a full duplex interconnect. So a single PCIe 4 x16 link could saturate as much as 50 GB/sec (25 GB/sec each direction) of system memory bandwidth when doing nothing but data copies using PINNED system memory. If using PAGEABLE system memory, the bandwidth requirements would be even higher.

@gerwin This can measured by running “nvidia-smi -q”, while under load and looking at the “GPU Link Info” section of each GPU.

By the way, are you sure the system uses DDR4-3200 memory? Intel’s specifications for the Xeon Silver 4316 processor list DDR4-2667 (Intel Xeon Silver 4316 Processor 30M Cache 2.30 GHz Product Specifications), which would provide about 166 GB/sec of usable system memory bandwidth per socket.