CUDA, NUMA Memory, and "NUMA" GPUs

I’ve been running some tests on my Tyan FT72-B7015, which is equipped with two hex-core X5650 processors and eight GTX-470 GPUs. While any processor core can communicate with any GPU, the GPUs themselves are partitioned amongst the processors. That is, four GPUs are “local” to the first processor, while the other four are “remote”. Accessing a remote GPU requires an extra hop over QPI between two Tylersburg controllers.

The system runs Ubuntu 10.04 and I am using the dev driver 260.24.

I’ve been trying to characterize the time it takes to transfer data between main memory and the GPUs. The platform has a NUMA memory model, so locality of memory with respect to both the CPU and GPU affects performance. I can use the Linux command “numactl” to bind a process to a particular CPU as well as NUMA memory pool. A single process sending 1GB of data (several times to arrive at averages) on an idle system, I find the following:

Local Memory and Local GPU: 228ms
Local Memory and Remote GPU: 221ms
Remote Memory and Local GPU: 347ms
Remote Memory and Remote GPU: 362ms

Let’s assume the discrepancy between local/local and local/remote is “in the noise” and focus on the remote/remote case. Why is this so great? Why is it not as speedy as local/local? I suppose the memory is taking the following long path through the system:
Processor 1’s Memory Pool -> Processor 1 -> Processor 0 -> Tylersburg 0 -> Tylersburg 1 -> Remote GPU

A better path would have been:
Processor 1’s Memory Pool -> Processor 1 -> Tylersburg 1 -> Remote GPU

I thought DMA was supposed to take care of this kind of business. Am I making too many assumptions about the hardware architecture’s capabilities? If so, is it a limitation of the Nehalem architecture or the Fermi architecture? Or is this an issue that can be fixed in the CUDA driver by making it NUMA aware?

(Aside: I believe Linux interleaves memory across NUMA pools by default. Anyone with a communication-bound CUDA application on a NUMA platform with local and remote GPUs may benefit from using numactl.)

Well, my performance issues go away when I replace malloc() with cudaHostAlloc(…, …, cudaHostAllocDefault). What is the difference between C’s malloc() and cudaHostAlloc(…, …, cudaHostAllocDefault)? What is it about malloc() that makes performance so poor?

cudaHostAlloc allocates pinned memory, which in turn eliminates a host-side copy between the application and driver memory spaces (all PCIe copies are DMA transactions, as such they must use host memory that cannot be paged out to disk).

Regarding your motherboard, is it a single- or dual-IOH motherboard? If both CPU sockets are directly connected to the same IOH (single-IOH solution), all CPU-GPU pairs are “local.” However, if you have a motherboard with two IOH chips, and GPUs connected to each, then you have “local” and “remote” CPU-GPU pairs. In my experience, going through an extra QPI hop for the remote pairs costs 1-2 GB/s in throughput. While I expect latency to increase, I don’t understand why throughput falls - QPI links have more bandwidth then PCIe.

Assuming NUMA is turned on in system BIOS, you can use numactl to force memory allocation on a specific socket (or socket local to the thread that’s doing allocation) or to interleave the allocations between the sockets. Look at the --membind and --interleave options to numactl.

I see. I was getting “portable” mixed up with “pinned.” I didn’t realize that cudaHostAlloc() only allocated pinned memory. I’m surprised that this can make such a significant different compared to plain old malloc() since my system has 24GB of memory (I never allocated more than 8GB at any time in my experiments). I didn’t know that DMA-assistend transfers were not possible with pagable memory, even if all the data is resident in memory. But hardware is dumb and probably doesn’t know what the OS is doing, so I guess this make sense.

I have a dual IOH system. There are six CPU cores in each of the two NUMA nodes. Each node has four local GPUs. To test the extra QPI hop, I set up four processes in each NUMA node which sent 1GB of data to a different remote GPU. The extra QPI hop was then handling four memory streams in each direction. It appears throughput was affected. In this scenario, it took ~626ms for a memory transaction to complete. In a similar scenario, where data was sent to local GPUs instead of remote GPUs, the time was only ~350ms.

Yes, I’ve been playing with this tool. Very handy.

Are there any tricks to making transfers with malloc() faster, besides numactl? I’ve noticed that performance degrades once the transfer size exceeds memory page size.

Finally, my tests show receiving (pinned) data takes 1.25x longer than sending data. Can anyone comment on why this is?