TESLA bandwidthTest results

dgwsoft · January 19, 2010, 10:35am

I am working on a system with 4 x TESLA C1060. When I run the bandwidthTest (from the CUDA SDK) I get results like this (MB/s)

for two of the cards:

Host → Device: 5300
Device → Host: 4670
Device->Device: 73400

for the other two:

Host → Device: 4750
Device → Host: 3150
Device->Device: 73400

So I have a couple of questions. First, the Device->Device looks a bit slow. TESLA is advertised as 102GB/s, and I get 90000+ MB/s with a GTX260 on my home PC.
But I can’t find published bandwidthTest results for a TESLA. If you have one, I would be grateful if you could post your bandwidthTest results for comparison.

Second, why would two cards be slower than the other two for Host->Device and Device->Host? Is that expected?

This is a Supermicro system with X8DTG-QF motherboard. 6x PCIe2.0 x16 physical (4 slots with 16 lanes, 2 slots with 4 lanes). OS is Linux (CentOS 5.3).

thanks in advance

Gareth Williams

avidday · January 19, 2010, 11:13am

I think that number is about right. The bandwidth test seems to hit about 75% of theoretical peak memory bandwidth in device-device copy (your GTX260, for example, should be something around 120Gb/s theoretical bandwidth). Most people report something around 75-80Gb/s for the C1060. It has a considerably lower memory clock than the consumer cards, which accounts for most of the difference.

There are two reasons, I believe. The first is that this is a NUMA board, so you need to have correctly set the processor affinity, otherwise the memory transfers can potentially be coming from the other CPUs memory, which has higher latency because of the extra QPI link hop. Even with that done, the second reason is less soluble. It seems that there is some sort of characteristic issues with these dual X58/5520 IO hub designs that can give rather asymmetrical bandwidth between PCI-e slots, and variation from slot to slot. Tyan have a similar design board with dual 5520 IO hubs, and it seems to have similar problems - see here for example.

eyalhir74 · January 19, 2010, 11:56am

Avidday - do you have a code sample/explain what is the best/fastest way to workaround the NUMA issues? linux/windows?

thanks

eyal

avidday · January 19, 2010, 12:32pm

On windows I have no idea. Under linux, numactl can be used to control a processes CPU and Memory affinity. If you look at the PCI-e device tree you should be able to work out which GPU is physically closest to a given GPU in the NUMA topology. The combination of numactl and CUDA device number selection should get you optimal settings, I think. I haven’t tried it on a dual Tylersburg system, as I don’t have access to any hardware. But something similar worked on a Opteron Nforce 3600 machine I used to have access to.

dgwsoft · January 19, 2010, 1:32pm

Thanks avidday, that’s exactly what I needed to know.

There is only one Intel XeonÂ® X5560 in this system (Quad core and with hyper-threading it appears as 8 processors to the OS). So I guess it can’t be a NUMA issue and must be just the PCI-e problem? Is there a better choice of motherboard for running TESLAs?

cheers

Gareth

avidday · January 19, 2010, 1:46pm

That is a surprise! I always assumed that those dual socket motherboards need two CPUs to post. I am very intrigued as to how the ACPI configuration will be handling interrupt routing with two IOHs and only one CPU.

Anyway, I really don’t think you will do better than what you have. The consumer x58 boards with 4 PCI-e x16 slots only actually have 32 PCI-e lanes total for GPUs (whereas your board has 64), and use an NVIDIA made switching ASIC to multiplex pairs of GPUs onto a shared 16 lane link. The peak per card bandwidth is OK, but because of the switch, there is latency and the simultaneous transfer speeds are less than half of what a single card can achieve.

Topic		Replies	Views
Abnormally Low Device To Host Memory Bandwidth CUDA Programming and Performance	4	8345	August 4, 2009
Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H CUDA Programming and Performance	18	26253	February 8, 2010
bandwidth test CUDA Programming and Performance	9	19239	March 24, 2009
PCI Bandwidth CUDA Programming and Performance	4	970	January 12, 2018
Memory bandwidth CUDA Programming and Performance	31	38489	October 5, 2007
Tesla C1060 Memory Bandwidth CUDA Programming and Performance	11	7559	August 19, 2010
why is Tesla C1060 working at PCIe8X instead of 16X? CUDA Programming and Performance	3	2844	August 31, 2009
Low host<->device bandwidth for one of two cards CUDA Programming and Performance	1	6481	March 22, 2011
very low bandwidth bandwidth problem with Tesla C870 CUDA Programming and Performance	5	3746	January 29, 2008
bandwith performance on PCI-E v1 slow? CUDA Programming and Performance	3	875	May 15, 2008

TESLA bandwidthTest results

Related topics