Abnormally Low Device To Host Memory Bandwidth

Hi,

I have used the bandwidthTest from the SDK, and wrote my own code to get the memory bandwidth using pinned memory on two different systems with slightly different architectures.
The first system is a bi socket server connect to 2 C1060 through on IOH (Tylersburg). The second one is the same but with two IOH.
In one case I obtained ~ 6 GB/s for both Host to Device and Device to Host transfers. On the second system, the Host to Device bandwidth is the same, but the Device to Host falls to only 3.5 GB/s. I have no idea where that big difference might come from. Does anyone faced the same problem ? Does anyone has any idea ?

Thank in advance for your help.

Best Regards

It’s because the PCIe topology of a dual X58 Tylersburg machine is such that one GPU is closer to one particular memory pool and vice-versa. If you want really good transfer performance, you need to use numactl to allocate memory on the “correct” CPU for whichever GPU you’re using.

We have tried something : we removed the memory of one of the 2 CPUs and removed one TESLA card. So this is equivalent to removing one CPU and having only one Tylersburg. The results are exactly the same. So I am not sure this comes from the NUMA configuration.

I’m not sure that this is the cause either, but you still have a setup where there’s could be an extra step. Let me try to explain with an ASCII diagram…

RAM 0 <-> CPU 0 <-> X58 0 <-> PCIe <-> GPU 0

		   |		   |

RAM 1 <-> CPU 1 <-> X58 1 <->  PCIe <-> GPU 1

That’s the general topology of a dual Tylersburg system. In my tests there could be a significant bandwidth penalty from having to do the extra hop over X58. Might not be your problem, though, if it only appears in DtoH.

Look at lspci -t, figure out which C1060 is closest to which CPU, and use numactl to force all allocations of that process to that CPU. If that doesn’t change anything, it’s probably a BIOS problem. There’s very, very little we can do to affect PCIe performance in the driver that isn’t immediately visible on all platforms, and as far as I’m aware nothing is seriously wrong right now.

We have met the same problem too. We are using the dual-IOH system (Supermicro) with Tesla S1070 and Quadro FX3800. We have paid attention to the NUMA condition of the system during the test (indeed we have found the difference in bandwidth when the memory is allocated in distant NUMA node). All of them gave low device-to-host bandwidth (~3.5GB/sec), but excellent host-to-device bandwidth.

When using the Quadro FX in a slot of the same m/b with only x8 Gen2 signals, the bandwidth for device-to-host is ~3.2GB/sec, dangerously closed to the x16 lanes performance.

[All tests were done using CUDA 2.3, XP x64, Intel chipset driver 9.10.1014]

We have contacted Supermicro and are waiting for their reply now. Hope there will be a solution soon.

Thanks and best regards.