I’ve been running some tests on my Tyan FT72-B7015, which is equipped with two hex-core X5650 processors and eight GTX-470 GPUs. While any processor core can communicate with any GPU, the GPUs themselves are partitioned amongst the processors. That is, four GPUs are “local” to the first processor, while the other four are “remote”. Accessing a remote GPU requires an extra hop over QPI between two Tylersburg controllers.
The system runs Ubuntu 10.04 and I am using the dev driver 260.24.
I’ve been trying to characterize the time it takes to transfer data between main memory and the GPUs. The platform has a NUMA memory model, so locality of memory with respect to both the CPU and GPU affects performance. I can use the Linux command “numactl” to bind a process to a particular CPU as well as NUMA memory pool. A single process sending 1GB of data (several times to arrive at averages) on an idle system, I find the following:
Local Memory and Local GPU: 228ms
Local Memory and Remote GPU: 221ms
Remote Memory and Local GPU: 347ms
Remote Memory and Remote GPU: 362ms
Let’s assume the discrepancy between local/local and local/remote is “in the noise” and focus on the remote/remote case. Why is this so great? Why is it not as speedy as local/local? I suppose the memory is taking the following long path through the system:
Processor 1’s Memory Pool -> Processor 1 -> Processor 0 -> Tylersburg 0 -> Tylersburg 1 -> Remote GPU
A better path would have been:
Processor 1’s Memory Pool -> Processor 1 -> Tylersburg 1 -> Remote GPU
I thought DMA was supposed to take care of this kind of business. Am I making too many assumptions about the hardware architecture’s capabilities? If so, is it a limitation of the Nehalem architecture or the Fermi architecture? Or is this an issue that can be fixed in the CUDA driver by making it NUMA aware?
(Aside: I believe Linux interleaves memory across NUMA pools by default. Anyone with a communication-bound CUDA application on a NUMA platform with local and remote GPUs may benefit from using numactl.)