Slow memcpy performance in dual-CPU, 10 GPU system

So per thread-starting post this system uses dual Intel® Xeon® Silver 4316 Processors, each with 8 DDR4-3200 channels and 64 PCIe 4 lanes. Are all DDR4 channels populated?

Yes, correct. I think all channels are populated. There are 64 RAM slots and 8 of them are filled according to the manual.

If the system presents more x16 links than that, the PLX chip would appear to multiplex them onto the physical lanes available.

This is indeed the case.

What is the actual link configuration ues by the RTX A4000?

Timestamp                                 : Wed Jan 18 09:35:21 2023
Driver Version                            : 525.60.11
CUDA Version                              : 12.0

Attached GPUs                             : 10

GPU 00000000:4F:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:52:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:53:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x
       
GPU 00000000:56:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:57:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:CE:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:D1:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:D2:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:D5:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

GPU 00000000:D6:00.0
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 16x

By the way, are you sure the system uses DDR4-3200 memory? Intel’s specifications for the Xeon Silver 4316 processor list DDR4-2667 (Intel Xeon Silver 4316 Processor 30M Cache 2.30 GHz Product Specifications), which would provide about 166 GB/sec of usable system memory bandwidth per socket.

You’re right! I was looking at the reported “Speed” in dmidecode but apparently there’s a section “Configured Speed” and the reported speed there is 2666.

I see two points of resource contention in this system. The ten GPUs need more PCIe 4 lanes (160) than are provided by the two CPUs (128), so the PLX chip(s) must mux/demux. How efficient that is, I cannot say, as I don’t have experience with platforms using the PLX chip.

The other resource contention is access to system memory, which likely gets hammered by the GPUs when there is lot of data movement between GPUs and host (likely because the GPU memory is small at 16 GB), while at the same time the software running on the host also has memory bandwidth needs.

I would expect that even with optimal configuration of CPU and memory affinity and scheduling policy this resource contention will impact system performance negatively; how much of a problem that is I cannot quantify.

If I had configured this system, I would have limited it to eight GPUs to match the number of required PCIe lanes to the number provided by the CPUs, and would have chosen a CPU variant that (1) offers the maximum bandwidth available with DDR4 (the eight-channel DDR4-3200 setup already mentioned) and (2) provides significantly higher single-thread performance (with GPU-accelerated applications, the serial portion of code remaining on the CPU can become a bottleneck in practice. In general, I recommend CPUs with base frequencies around 3.5 GHz.

I fixed the code to use pinned buffers and now I’m seeing this:

image

This shows it wasn’t GPU transfer speeds at all. Instead, cudaMallocHost and cudaFreeHost are just sitting there taking many milliseconds to complete. I’m guessing this was the underlying problem all along (but by using pageable memory this is hidden). As you can see the cudaMallocHost is locking on something and doing a bunch of ioctl. I’m not sure how to interpret the trace…

EDIT: Btw, I’m going to change the code to not allocate/deallocate for every invocation and will get back with the results of that.

If you search this forum you will find many questions (& answers) about the performance of cudaMallocHost. The jist of it is: The vast majority of the time in this function is spent in operating system API calls, the CUDA API is basically a thin wrapper around that. Also: This kind of OS activity is largely single-threaded, and because there is resource contention between processes it usually involves a big fat lock, or even multiple ones.

Since we cannot change the OS code, the #1 practical knob to turn is (1) use of a CPU with high single-thread performance, which to first order means a high base clock: see my 3.5 GHz recommendation (2) as a distant #2: Use of low-latency, high-throughput system memory, which boils down to deploying the fastest speed grade of DRAM available and as many channels of DDR4 --soon DDR5 – as one can afford. Nowadays, eight-channel memory configurations aren’t ridiculously expensive any more.

For CUDA application-level programming, the basic rule is: since allocating and freeing memory is fairly expensive, it should be done infrequently. Reuse buffers as much as possible, use an app-level sub-allocator, etc.

It seems you’re right. Even when getting rid of allocation and deallocation, I still can’t saturate the GPU but this time it is due to the memcpy I’m doing into the pinned host buffer. I guess I’m dealing with CPU/memory contention.