TESLA bandwidthTest results

I am working on a system with 4 x TESLA C1060. When I run the bandwidthTest (from the CUDA SDK) I get results like this (MB/s)

for two of the cards:

Host -> Device: 5300
Device -> Host: 4670
Device->Device: 73400

for the other two:

Host -> Device: 4750
Device -> Host: 3150
Device->Device: 73400

So I have a couple of questions. First, the Device->Device looks a bit slow. TESLA is advertised as 102GB/s, and I get 90000+ MB/s with a GTX260 on my home PC.
But I can’t find published bandwidthTest results for a TESLA. If you have one, I would be grateful if you could post your bandwidthTest results for comparison.

Second, why would two cards be slower than the other two for Host->Device and Device->Host? Is that expected?

This is a Supermicro system with X8DTG-QF motherboard. 6x PCIe2.0 x16 physical (4 slots with 16 lanes, 2 slots with 4 lanes). OS is Linux (CentOS 5.3).

thanks in advance

Gareth Williams

I think that number is about right. The bandwidth test seems to hit about 75% of theoretical peak memory bandwidth in device-device copy (your GTX260, for example, should be something around 120Gb/s theoretical bandwidth). Most people report something around 75-80Gb/s for the C1060. It has a considerably lower memory clock than the consumer cards, which accounts for most of the difference.

There are two reasons, I believe. The first is that this is a NUMA board, so you need to have correctly set the processor affinity, otherwise the memory transfers can potentially be coming from the other CPUs memory, which has higher latency because of the extra QPI link hop. Even with that done, the second reason is less soluble. It seems that there is some sort of characteristic issues with these dual X58/5520 IO hub designs that can give rather asymmetrical bandwidth between PCI-e slots, and variation from slot to slot. Tyan have a similar design board with dual 5520 IO hubs, and it seems to have similar problems - see here for example.

Avidday - do you have a code sample/explain what is the best/fastest way to workaround the NUMA issues? linux/windows?

thanks

eyal

On windows I have no idea. Under linux, numactl can be used to control a processes CPU and Memory affinity. If you look at the PCI-e device tree you should be able to work out which GPU is physically closest to a given GPU in the NUMA topology. The combination of numactl and CUDA device number selection should get you optimal settings, I think. I haven’t tried it on a dual Tylersburg system, as I don’t have access to any hardware. But something similar worked on a Opteron Nforce 3600 machine I used to have access to.

Thanks avidday, that’s exactly what I needed to know.

There is only one Intel Xeon® X5560 in this system (Quad core and with hyper-threading it appears as 8 processors to the OS). So I guess it can’t be a NUMA issue and must be just the PCI-e problem? Is there a better choice of motherboard for running TESLAs?

cheers

Gareth

That is a surprise! I always assumed that those dual socket motherboards need two CPUs to post. I am very intrigued as to how the ACPI configuration will be handling interrupt routing with two IOHs and only one CPU.

Anyway, I really don’t think you will do better than what you have. The consumer x58 boards with 4 PCI-e x16 slots only actually have 32 PCI-e lanes total for GPUs (whereas your board has 64), and use an NVIDIA made switching ASIC to multiplex pairs of GPUs onto a shared 16 lane link. The peak per card bandwidth is OK, but because of the switch, there is latency and the simultaneous transfer speeds are less than half of what a single card can achieve.