When I execute the example bandwidthTest in CUDA 4.0 on a C2070,
I get host to device bandwidth of 3436 MB/s
Is the above the max possible bandwith?
According to one of the pdf documents that came with CUDA 4.0,
when using pinned memory one can expect bandwiths in excess of 5 GB/s.
So, how does one get the bax bandwith possible from host to device in CUDA 4.0 on a c2070?
Motherboards based on the Intel X58 chipset often get 6-6.5 GB/sec when RAM is installed in a triple channel configuration.
The limit on maximum bandwidth between the CPU and multiple CUDA devices is set by the link between the CPU and the motherboard chipset. On the Intel X58 motherboards, this is QPI, and for AMD systems, it his HyperTransport. Both standards generally deliver something like 12 GB/sec in each direction, so you can max out a single QPI or HT link with two CUDA devices. To get maximum bandwidth to more than two devices, you need multiple links, which means dual CPU socket motherboard and PCI-Express slots attached to each CPU. Dual X58 server motherboards exist, but I have no experience with them. There were reports of odd bandwidth behavior when they first came out, but those issues may have been fixed with BIOS updates.