bandwidthTest in CUDA 4.0 on C2070

When I execute the example bandwidthTest in CUDA 4.0 on a C2070,
I get host to device bandwidth of 3436 MB/s

Is the above the max possible bandwith?

According to one of the pdf documents that came with CUDA 4.0,
when using pinned memory one can expect bandwiths in excess of 5 GB/s.
So, how does one get the bax bandwith possible from host to device in CUDA 4.0 on a c2070?

You did not write if you gave commandline options to the bandwithTest example. It does not use pinned memory by default.

With pinned memory, I get host to device transfer bandwidth of about 5700 MB/s for transfer sizes of 4MB or more.

So, is that about the max that can be reached?

Is there any way to do better than that?

That is the maximum more or less. You can achieve slightly better values by switching to a different mainboard.

Which motherboards allow the max bandwidths?

Also, if there are several c2070 cards connected to the same motherboard with x16 slots, is it possible

to maintain 5700 MB/s bandwidth to all of them at the same time? If so, how?

Motherboards based on the Intel X58 chipset often get 6-6.5 GB/sec when RAM is installed in a triple channel configuration.

The limit on maximum bandwidth between the CPU and multiple CUDA devices is set by the link between the CPU and the motherboard chipset. On the Intel X58 motherboards, this is QPI, and for AMD systems, it his HyperTransport. Both standards generally deliver something like 12 GB/sec in each direction, so you can max out a single QPI or HT link with two CUDA devices. To get maximum bandwidth to more than two devices, you need multiple links, which means dual CPU socket motherboard and PCI-Express slots attached to each CPU. Dual X58 server motherboards exist, but I have no experience with them. There were reports of odd bandwidth behavior when they first came out, but those issues may have been fixed with BIOS updates.

I have access to 3 systems, each with a dual-CPU Xeon motherboard:

  1. Tyan S7015 motherboard that can accept 8 Tesla c2070 cards and has the following chipset:

Chipset IOH / ICH Intel (2) 5520 / ICH10R

Super I/O Winbond W83627

PCI-E Switch PLX PEX8647

  1. Tyan S7025AGM2NR that can accept 4 Tesla c2070 cards and has

Chipset IOH / ICH Intel (2) 5520 / ICH10R

Super I/O Winbond W83627DHG

  1. Supermicro X8DTG-QF that can accept 4 Tesla c2070 cards and has

Chipset Intel® 5520 (Tylersburg) chipset

ICH10R + 2x IOH-36D

So, what is the max bandwidth that I can expect to get from the two CPUs to Tesla c2070 cards on each motherboard?

Is it 2 x 5.7 GBytes/s per CPU, i.e., 2 x 11.4 = 22.8 GBytes/s when both CPUs are used?