I bought a system for a researcher to use experimenting with CUDA under Linux. The system has an Intel W2600CR motherboard with dual Xeon E5-2620@2 Ghz with 64 GB DDR3 1600 Mhz memory, and 4 ASUS GTX 670 cards with 2 GB DDR5 memory. He is having an issue with performance of pageable memory and CUDA. If I run CUDA-Z on this system, I get these results:
Memory Copy
Host Pinned to Device: 5881.56 MiB/s
Host Pageable to Device: 2615.68 MiB/s
Device to Host Pinned: 4585.86 MiB/s
Device to Host Pageable: 2089.42 MiB/s
Device to Device: 69.1433 GiB/s
If I take this same card, and put it in an Intel DX58S0 motherboard with one core i7-950 3 Ghz with 4 GB of memory (from 2009!!), I get these numbers:
Memory Copy
Host Pinned to Device: 5898.45 MiB/s
Host Pageable to Device: 4253.14 MiB/s
Device to Host Pinned: 6053.43 MiB/s
Device to Host Pageable: 4022.74 MiB/s
Device to Device: 70.279 GiB/s
Why would the pageable performance differ so much between systems? Why would "pinned" performance be okay? Could this have anything to do with the 3 Ghz processor in the (older) core i7 versus the 2 Ghz processor in the newer Xeon? Memory in the core i7 system is slower. Both cards are running PCIE2 in an x16 slot. Actually, the W2600CR has PCIE3 support, but its C600-A chipset is based on X79, and hence there are problems with PCIE3 and the 670, so I’ve given up on PCIE3.
I’ve had our vendor trying to deal back and forth with Intel, but absolutely nothing is coming of it, and I’m stuck. We can’t return the system. The last response from Intel was (the "we don’t know why response") "If the video card is not validated or compatible with the Intel® Server Board, the Bus communications will affected and it will create low performance.". This doesn’t help at all.
Memory in both systems is optimized.
Test results with CUDA-Z and bandwidthTest --memory=pageable show similar results.
Similarly, if I take the GTX580 that was in the core i7 system, and move it to the Xeon system, the performance is affected similarly.
In addition, today I very sadly found that while the W2600CR motherboard has 4 x16 slots, only 1 of them is x16 electrical - the others are x8 electrical. For the purposes of this testing, I’m only using the x16 slot, but, when I run the test on all the other slots, the results are almost identical. Shouldn’t I get better performance on the real x16 slot when running CUDA-Z? I do see that it is running 5 GT/s during the test using lspci. Will x8 give the researcher significantly poorer performance?
Please help! I’m really stuck.