Slow memcpy performance in dual-CPU, 10 GPU system

small/short transfers can easily drop down to arbitrarily low levels of performance. The peak transfer speeds will only be seen for transfers that are megabytes in size or larger, and are to/from pinned memory.

I believe bandwidthTest runs a 32MB transfer by default. 21MB might be sliding down a bit on the efficiency curve.

Can these factors combined could cause the combined memory transfer speeds (D->H or H->D) of 10 GPUs to drop below the bandwidth of a single GPU?

Another factor that can affect apparent H<->D speeds in a 2 socket server is process placement/affinity. Transfers to a non “affine” GPU may be additionally slower when travelling over the intersocket link. concurrency over this link will also reduce perf.

I assumed that MPS would handle affinity internally in the multi-GPU case. Is there any way to configure MPS to handle this?

Sure, I could concoct a case of very small transfers from multiple GPUs simultaneously that drops to arbitrarily low levels, much slower than 0.1 x peak

No, that is not part of MPS. MPS has no control over process placement.

Sure, I could concoct a case of very small transfers from multiple GPUs simultaneously that drops to arbitrarily low levels, much slower than 0.1 x peak

Okay and does 21MB count as small? Because all the traces I’ve posted above the D->H transfers are 21MB (that is the model output).

I tried to reproduce the issue by trying to saturate the all 10 GPUs using the bandwidth example but I was only able to see transfer speeds drop by a factor 0.5. When stressing the CPU I got 0.1x in some cases but not as bad as the traces. I’ll get back here with some data and a code example.

I don’t think 21MB is small. bandwidthTest gives you enough controls to characterize it if you wish. Linux also has process placement tools, so you could use something like numactl or taskset to additionally test the effect of affinity.

I’ve been trying to reproduce this issue and I think I’ve gotten somewhere. I’m using bandwidthTest.

Single GPU case

To start, I ran this command to get a baseline reading on a single GPU:

/usr/local/cuda/extras/demo_suite/bandwidthTest --device=0 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh

The results are:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA RTX A4000
 Range Mode

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			5125.6
   21000000			5187.4
   22000000			5186.1
   23000000			5184.9
   24000000			5177.7
   25000000			5188.1
   26000000			5182.6
   27000000			5191.7
   28000000			5182.7
   29000000			5192.6
   30000000			5191.4

Result = PASS

So around 5.2 GiB/s which is great.

Multi-GPU case

Then, I ran the test for each device simultaneously using this script:

/usr/local/cuda/extras/demo_suite/bandwidthTest --device=0 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=1 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=2 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=3 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=4 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=5 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=6 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=7 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=8 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh & \
/usr/local/cuda/extras/demo_suite/bandwidthTest --device=9 --memory=pageable --mode=range --start=20000000 --end=30000000 --increment=1000000 --dtoh

It’s not perfect but a quick glance at nvidia-smi shows that at least 8 GPUs were around 40% at the same time so there’s definitely some overlap and the result show this as well:

(Just showing part of it because there is so much output:)

   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			4280.2
   21000000			3949.6
   22000000			3902.3

Another one looks like this:

   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			4448.5
   21000000			4516.0
   22000000			4628.8

Generally, I’m getting anywhere from 3.5GiB/s up to 5GiB/s.

Multi-GPU with high CPU load

I did a simple stress test with stress --cpu 160 -t 60 (all cores will be doing sqrt on repeat) and ran the multi GPU test again, and got:

   20000000			2589.2
   21000000			2157.9
   22000000			2225.7
   23000000			1937.3

...

   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			1967.4
   21000000			1592.4
   22000000			1618.9

The range is between 1.5GiB/s up to around 4GiB/s. Already we’re seeing a decrease in transfer speeds due to (unrelated) CPU load.

Multi-GPU with RAM load

And again with stress --vm 160 -t 60 which will do malloc free in a loop on all cores:

Snippets from the results:

   Transfer Size (Bytes)	Bandwidth(MB/s)
   20000000			1037.3
   21000000			828.2
   22000000			842.9

...
   20000000			1618.3
   21000000			1476.4
   22000000			1139.5
   23000000			955.0

So CPU usage seems to play a role here. I still did not get 12MiB/s transfer speeds but I’m getting closer to what I’m seeing in the traces.

Do you have any idea how CPU load might play a role in this? Is it simply that the OS scheduler is not giving CUDA enough time to fetch data from the device?

Curiously, when tracing the bandwidthTest, nsys reports varying throughput (just like the traces from before) ranging from 200MiB/s to 5GiB/s on the high end. This seems very similar to what I’m seeing with TensorRT. bandwidthTest itself reports more uniform speeds, maybe because it averages a bunch of transfers?

This copy for example has a throughput of around 285MiB/s according to nsys:

Screenshot from 2023-01-17 19-27-04

A host to device or device to host transfer uses up bandwidth, at both ends. It uses bandwidth at the CPU side and it uses bandwidth at the GPU side. Compared to PCIE bandwidth (e.g. ~12GB/s), A4000 device memory bandwidth of ~448 GB/s is probably immaterial.

But I don’t know the bandwidth available on your CPU side.

Speaking for myself, if I were embarking on this investigation, I would certainly get the system topology diagram. Also, for the current direction, the CPU model would be of interest, as well as perhaps memory config (whether all DIMM slots are populated, BIOS settings, etc.)

The transfer you are now showing is a pageable transfer. Those place “additional” stress on the host memory subsystem and may have other issues that impact measurement and performance. Its recommended to used pinned buffers/transfers for best transfer throughput, and its possible that the pageable transfers are contributing to measurement variability and impacting performance.

I just reran the bandwidthTest multi-GPU with pinned instead of pageable and interestingly enough copies are not affected at all by CPU load anymore :). I’m going to try and see if I can get the same effect in TensorRT.

Use of the cudaHostAllocWriteCombined flag may be beneficial, particularly in the H->D direction.

What are the host system specifications? PAGEABLE transfers actually consist of a DMA copy across PCIe (using a pinned system memory buffer owned by the driver), followed by a system memory copy from the pinned buffer to the user process memory. For the kind of transfer sizes shown here, what you should see for a PCIe 3 interconnect on a x16 link is about 13 GB/sec for PINNED transfers (and about twice that for a PCIe 4 interconnect).

The speed of PAGEABLE transfers, which involve an additional copy from system memory to system memory, will depend on the performance of system memory, and the 5.2GB/sec shown here could indicate a bottleneck in system memory. I see more than 5.2 GB/sec bandwidth reported on my slightly older Skylake-based system. How many DDR4 channels per CPU socket are populated in this system, and what speed-grade of DDR4 is being used? A large high-end system (which is what appears to be used here) ideally would use 8 channels of DDR4-3200.

From perusing this thread (i.e. no reading in detail), it seems the system may have multiple GPUs. In this case, make sure that each GPU is on a x16 PCIe link. Does the CPU provide enough PCIe lanes for this? If this is also a multi-CPU system make sure to use processor and memory affinity control (e.g. with numactl) to make sure each GPU is communicating with the “near” CPU and the “near” system memory.

What are the host system specifications?

Xeon dual-CPU and 10 GPUs (see full specs in first post)

PAGEABLE transfers actually consist of a DMA copy across PCIe (using a pinned system memory buffer owned by the driver), followed by a system memory copy from the pinned buffer to the user process memory. For the kind of transfer sizes shown here, what you should see for a PCIe 3 interconnect on a x16 link is about 13 GB/sec for PINNED transfers (and about twice that for a PCIe 4 interconnect).

I’m seeing around 25GiB/s peak, but it does drop to 10MiB/s sometimes in multi-GPU scenarios though.

The speed of PAGEABLE transfers, which involve an additional copy from system memory to system memory, will depend on the performance of system memory, and the 5.2GB/sec shown here could indicate a bottleneck in system memory.

That is actually best case. I’m actually seeing as low as 12MiB/s in traces or 200MiB/s in the testBandwidth test when the host system is under load. As said before though, that might actually be caused by the DMA copy (which is somehow affected by host being busy).

I see more than 5.2 GB/sec bandwidth reported on my slightly older Skylake-based system. How many DDR4 channels per CPU socket are populated in this system, and what speed-grade of DDR4 is being used? A large high-end system (which is what appears to be used here) ideally would use 8 channels of DDR4-3200.

Yes memory is DDR4 32000

From perusing this thread (i.e. no reading in detail), it seems the system may have multiple GPUs. In this case, make sure that each GPU is on a x16 PCIe link.

Yes, each one has their own 16x.

Does the CPU provide enough PCIe lanes for this?

Dual-CPU with 64 lanes per CPU and there’s a PLX that eventually connect every GPU with 16 lanes.

If this is also a multi-CPU system make sure to use processor and memory affinity control (e.g. with numactl) to make sure each GPU is communicating with the “near” CPU and the “near” system memory.

This is a good one and I’ll be trying this out. Still figuring out exactly how to do this with MPS.

So per thread-starting post this system uses dual Intel® Xeon® Silver 4316 Processors, each with 8 DDR4-3200 channels and 64 PCIe 4 lanes. Are all DDR4 channels populated?

64 PCIe channels means there can be no more than four native PCIe 4 x16 links. If the system presents more x16 links than that, the PLX chip would appear to multiplex them onto the physical lanes available.

It should not be so much the host CPU being busy as the system memory being busy (that is where the DMA transfers discussed here terminate). 8 channels of DDR4-3200 provide about 200 GB/sec of usable system memory bandwidth (run the STREAM benchmark to find out what is actually achieved).

Given the number of GPUs in the system, and assuming that they are all working at the same time and that not all GPU/host transfers use PINNED memory, and considering that the software running on the host system needs system memory bandwidth as well, the hypothesis of the system becoming bottlenecked on system memory throughput still seems plausible to me.

When configuring a system, keep in mind that PCIe is a full duplex interconnect. So a single PCIe 4 x16 link could saturate as much as 50 GB/sec (25 GB/sec each direction) of system memory bandwidth when doing nothing but data copies using PINNED system memory. If using PAGEABLE system memory, the bandwidth requirements would be even higher.

@gerwin This can measured by running “nvidia-smi -q”, while under load and looking at the “GPU Link Info” section of each GPU.

By the way, are you sure the system uses DDR4-3200 memory? Intel’s specifications for the Xeon Silver 4316 processor list DDR4-2667 (Intel Xeon Silver 4316 Processor 30M Cache 2.30 GHz Product Specifications), which would provide about 166 GB/sec of usable system memory bandwidth per socket.

So per thread-starting post this system uses dual Intel® Xeon® Silver 4316 Processors, each with 8 DDR4-3200 channels and 64 PCIe 4 lanes. Are all DDR4 channels populated?

Yes, correct. I think all channels are populated. There are 64 RAM slots and 8 of them are filled according to the manual.

If the system presents more x16 links than that, the PLX chip would appear to multiplex them onto the physical lanes available.

This is indeed the case.

What is the actual link configuration ues by the RTX A4000?

Timestamp                                 : Wed Jan 18 09:35:21 2023
Driver Version                            : 525.60.11
CUDA Version                              : 12.0

Attached GPUs                             : 10

GPU 00000000:4F:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:52:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:53:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
       
GPU 00000000:56:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:57:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:CE:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:D1:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:D2:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:D5:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:D6:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

By the way, are you sure the system uses DDR4-3200 memory? Intel’s specifications for the Xeon Silver 4316 processor list DDR4-2667 (Intel Xeon Silver 4316 Processor 30M Cache 2.30 GHz Product Specifications), which would provide about 166 GB/sec of usable system memory bandwidth per socket.

You’re right! I was looking at the reported “Speed” in dmidecode but apparently there’s a section “Configured Speed” and the reported speed there is 2666.

I see two points of resource contention in this system. The ten GPUs need more PCIe 4 lanes (160) than are provided by the two CPUs (128), so the PLX chip(s) must mux/demux. How efficient that is, I cannot say, as I don’t have experience with platforms using the PLX chip.

The other resource contention is access to system memory, which likely gets hammered by the GPUs when there is lot of data movement between GPUs and host (likely because the GPU memory is small at 16 GB), while at the same time the software running on the host also has memory bandwidth needs.

I would expect that even with optimal configuration of CPU and memory affinity and scheduling policy this resource contention will impact system performance negatively; how much of a problem that is I cannot quantify.

If I had configured this system, I would have limited it to eight GPUs to match the number of required PCIe lanes to the number provided by the CPUs, and would have chosen a CPU variant that (1) offers the maximum bandwidth available with DDR4 (the eight-channel DDR4-3200 setup already mentioned) and (2) provides significantly higher single-thread performance (with GPU-accelerated applications, the serial portion of code remaining on the CPU can become a bottleneck in practice. In general, I recommend CPUs with base frequencies around 3.5 GHz.

I fixed the code to use pinned buffers and now I’m seeing this:

image

This shows it wasn’t GPU transfer speeds at all. Instead, cudaMallocHost and cudaFreeHost are just sitting there taking many milliseconds to complete. I’m guessing this was the underlying problem all along (but by using pageable memory this is hidden). As you can see the cudaMallocHost is locking on something and doing a bunch of ioctl. I’m not sure how to interpret the trace…

EDIT: Btw, I’m going to change the code to not allocate/deallocate for every invocation and will get back with the results of that.

If you search this forum you will find many questions (& answers) about the performance of cudaMallocHost. The jist of it is: The vast majority of the time in this function is spent in operating system API calls, the CUDA API is basically a thin wrapper around that. Also: This kind of OS activity is largely single-threaded, and because there is resource contention between processes it usually involves a big fat lock, or even multiple ones.

Since we cannot change the OS code, the #1 practical knob to turn is (1) use of a CPU with high single-thread performance, which to first order means a high base clock: see my 3.5 GHz recommendation (2) as a distant #2: Use of low-latency, high-throughput system memory, which boils down to deploying the fastest speed grade of DRAM available and as many channels of DDR4 --soon DDR5 – as one can afford. Nowadays, eight-channel memory configurations aren’t ridiculously expensive any more.

For CUDA application-level programming, the basic rule is: since allocating and freeing memory is fairly expensive, it should be done infrequently. Reuse buffers as much as possible, use an app-level sub-allocator, etc.

It seems you’re right. Even when getting rid of allocation and deallocation, I still can’t saturate the GPU but this time it is due to the memcpy I’m doing into the pinned host buffer. I guess I’m dealing with CPU/memory contention.