Abnormally Low Device To Host Memory Bandwidth

Mat38 · July 17, 2009, 1:10pm

Hi,

I have used the bandwidthTest from the SDK, and wrote my own code to get the memory bandwidth using pinned memory on two different systems with slightly different architectures.
The first system is a bi socket server connect to 2 C1060 through on IOH (Tylersburg). The second one is the same but with two IOH.
In one case I obtained ~ 6 GB/s for both Host to Device and Device to Host transfers. On the second system, the Host to Device bandwidth is the same, but the Device to Host falls to only 3.5 GB/s. I have no idea where that big difference might come from. Does anyone faced the same problem ? Does anyone has any idea ?

Thank in advance for your help.

Best Regards

tmurray · July 17, 2009, 3:47pm

It’s because the PCIe topology of a dual X58 Tylersburg machine is such that one GPU is closer to one particular memory pool and vice-versa. If you want really good transfer performance, you need to use numactl to allocate memory on the “correct” CPU for whichever GPU you’re using.

Mat38 · July 20, 2009, 7:20am

We have tried something : we removed the memory of one of the 2 CPUs and removed one TESLA card. So this is equivalent to removing one CPU and having only one Tylersburg. The results are exactly the same. So I am not sure this comes from the NUMA configuration.

tmurray · July 20, 2009, 9:20pm

I’m not sure that this is the cause either, but you still have a setup where there’s could be an extra step. Let me try to explain with an ASCII diagram…

RAM 0 <-> CPU 0 <-> X58 0 <-> PCIe <-> GPU 0

		   |		   |

RAM 1 <-> CPU 1 <-> X58 1 <->  PCIe <-> GPU 1

That’s the general topology of a dual Tylersburg system. In my tests there could be a significant bandwidth penalty from having to do the extra hop over X58. Might not be your problem, though, if it only appears in DtoH.

Look at lspci -t, figure out which C1060 is closest to which CPU, and use numactl to force all allocations of that process to that CPU. If that doesn’t change anything, it’s probably a BIOS problem. There’s very, very little we can do to affect PCIe performance in the driver that isn’t immediately visible on all platforms, and as far as I’m aware nothing is seriously wrong right now.

khengling · August 4, 2009, 3:40am

We have met the same problem too. We are using the dual-IOH system (Supermicro) with Tesla S1070 and Quadro FX3800. We have paid attention to the NUMA condition of the system during the test (indeed we have found the difference in bandwidth when the memory is allocated in distant NUMA node). All of them gave low device-to-host bandwidth (~3.5GB/sec), but excellent host-to-device bandwidth.

When using the Quadro FX in a slot of the same m/b with only x8 Gen2 signals, the bandwidth for device-to-host is ~3.2GB/sec, dangerously closed to the x16 lanes performance.

[All tests were done using CUDA 2.3, XP x64, Intel chipset driver 9.10.1014]

We have contacted Supermicro and are waiting for their reply now. Hope there will be a solution soon.

Thanks and best regards.

Topic		Replies	Views
TESLA bandwidthTest results CUDA Programming and Performance	5	2902	January 19, 2010
Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H CUDA Programming and Performance	18	26294	February 8, 2010
PCI Bandwidth CUDA Programming and Performance	4	993	January 12, 2018
Memory bandwidth CUDA Programming and Performance	31	38544	October 5, 2007
Host<-> device bandwidth problems slow and intermittent bandwidth on linux CUDA Programming and Performance	9	6737	January 8, 2008
Yet another bandwidth question CUDA Programming and Performance	6	3286	October 27, 2008
bandwidth test CUDA Programming and Performance	9	19283	March 24, 2009
Low memory bandwidth CUDA Programming and Performance	4	7192	March 10, 2008
K10 Host <--> Device bandwidth seem low CUDA Programming and Performance	5	1029	March 20, 2013
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68161	April 18, 2008

Abnormally Low Device To Host Memory Bandwidth

Related topics