Performance of Paged Memory on GTX670 with CUDA On New Xeon System

I bought a system for a researcher to use experimenting with CUDA under Linux. The system has an Intel W2600CR motherboard with dual Xeon E5-2620@2 Ghz with 64 GB DDR3 1600 Mhz memory, and 4 ASUS GTX 670 cards with 2 GB DDR5 memory. He is having an issue with performance of pageable memory and CUDA. If I run CUDA-Z on this system, I get these results:

Memory Copy
Host Pinned to Device: 5881.56 MiB/s
Host Pageable to Device: 2615.68 MiB/s
Device to Host Pinned: 4585.86 MiB/s
Device to Host Pageable: 2089.42 MiB/s
Device to Device: 69.1433 GiB/s

If I take this same card, and put it in an Intel DX58S0 motherboard with one core i7-950 3 Ghz with 4 GB of memory (from 2009!!), I get these numbers:

Memory Copy
Host Pinned to Device: 5898.45 MiB/s
Host Pageable to Device: 4253.14 MiB/s
Device to Host Pinned: 6053.43 MiB/s
Device to Host Pageable: 4022.74 MiB/s
Device to Device: 70.279 GiB/s

Why would the pageable performance differ so much between systems? Why would "pinned" performance be okay? Could this have anything to do with the 3 Ghz processor in the (older) core i7 versus the 2 Ghz processor in the newer Xeon? Memory in the core i7 system is slower. Both cards are running PCIE2 in an x16 slot. Actually, the W2600CR has PCIE3 support, but its C600-A chipset is based on X79, and hence there are problems with PCIE3 and the 670, so I’ve given up on PCIE3.

I’ve had our vendor trying to deal back and forth with Intel, but absolutely nothing is coming of it, and I’m stuck. We can’t return the system. The last response from Intel was (the "we don’t know why response") "If the video card is not validated or compatible with the Intel® Server Board, the Bus communications will affected and it will create low performance.". This doesn’t help at all.

Memory in both systems is optimized.

Test results with CUDA-Z and bandwidthTest --memory=pageable show similar results.

Similarly, if I take the GTX580 that was in the core i7 system, and move it to the Xeon system, the performance is affected similarly.

In addition, today I very sadly found that while the W2600CR motherboard has 4 x16 slots, only 1 of them is x16 electrical - the others are x8 electrical. For the purposes of this testing, I’m only using the x16 slot, but, when I run the test on all the other slots, the results are almost identical. Shouldn’t I get better performance on the real x16 slot when running CUDA-Z? I do see that it is running 5 GT/s during the test using lspci. Will x8 give the researcher significantly poorer performance?

Please help! I’m really stuck.

My guess is that the low bandwidth is caused by using a two socket motherboard.

Looking at the functional architecture of the Intel W2600CR motherboard (page 20 of [url]http://download.intel.com/support/motherboards/server/sb/g34153003_s2600ip_w2600cr_tps_r110.pdf[/url]), could it be that when you use pageable memory, the CPU memory is allocated on CPU-1, but the GPU is connected to CPU-2??? This would result in a memory copy from CPU-1 to a GPU via CPU-2, resulting in a low bandwidth. If you allocate pinned memory via cudaHostAlloc(), this function knows which GPU you are going to use (via cudaSetDevice, defaults to GPU-0), and can allocate CPU memory on the CPU which is directly connected to this GPU, avoiding an extra hop through the other CPU. I’m not sure about this though, I have never tested multi-socket systems in combination with CUDA, but this would by my guess.

Excellent suggestion!

However, according to the picture, the single x16 slot that is x16 electrical on the board is connected to CPU-1. I took all the other cards out of the system, so, when I run the test, it automatically runs on CPU-1. However, you can also use the taskset command under Linux to choose a core… so, E5-2620 is a 6 core CPU, 12 cores with hyperthreading… so let’s walk through running bandwithTest on all the cores, and seeing if the results differ…

Starting with core 0 …

% taskset -c 0 ./bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX 670
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2951.4

Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2768.9

Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 151036.9

Interesting that -c 0 through -c 5 all produce very similar results. However, I did notice a small decrease in -c 6 through -c 11… (it’s not clear if these are the “fake” CPUs that show up because of hyperthreading, or the “real” second CPU…)

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX 670
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2897.8

Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2467.9 <---- this number goes down

Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 151006.1

… no matter how many times I ran it, that second number was smaller…

Then with -c 12 to -c 17 back up there again …

Device 0: GeForce GTX 670
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2975.7

Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2768.5

Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 151051.7

and then lower again for -c 18 through -c 23

[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX 670
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2892.4

Device to Host Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2479.6

Device to Device Bandwidth, 1 Device(s)
PAGEABLE Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 151030.0

But still not close to the 4000+ MB/s on the old core i7 system.

Can any of this be because of the speed of the processor?

Ideas?

Here’s an interesting result – remove the second CPU (and 32 GB of memory attached to it), and the performance test yields the proper response… even though we weren’t using the second CPU, nor the memory, having it there reduced the paged memory performance… the question is – why!?

That’s a strong indication that the memory traffic to/from the GPU was routed to the “far” socket. I am not an expert on this, but I seem to recall that Linux has mechanisms for controlling CPU as well as memory affinity. You will have to control both to make sure that the system memory used during the test is the memory coupled to the “near” socket, in addition to making sure that the GPU is controlled by the "near socket (which you already seem to be doing). Don’t despair, these are normal growing pains when moving to a NUMA system.

This website appears to provide a useful overview how to control mappings in a NUMA environment:

[url]https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/main-cpu.html[/url]