Recommended Hardware Configuration For higher transfer rates

Which hardware configuration boasts the {highest} memory transfer rate between CPU and GPU. The CUDA 2.1 FAQ (Q18) gives some pointers, but it is still a bit sketchy.

Which GPU/Chipset/Memory/CPU combination is recommended these days? Please post your findings.

Search the forums for Core i7 tests. They currently blow everything else out of the water. One HOOMD user on such a system is reporting 8.68 GiB/s host <-> device transfer rates. Note that with the integrated memory controller in Core i7, the faster the CPU clock, the faster your host <-> device transfer rates will be.

Mr. Anderson, as far as I remember the PCI-E 2.0 x16 link speed cannot exceed the theoretical 8 GB/s limit. ( http://en.wikipedia.org/wiki/PCI_Express ). Did you overclock your system? Also, the PCI-E 2 slots are connected to the motherboard’s chipset, X58 for Core i7 (http://en.wikipedia.org/wiki/File:X58_Block_Diagram.png), so if your host device’s memory bandwidth is fast enough you can get close to the pick speed even with previous gen CPU’s and chipsets, let’s say with X38. Am I missing something in here?

I don’t know the specifics of the system. As I said, this was a benchmark posted by a user. It was also using PCIe H<->D transfers between two GPUs, so the theoretical peak would actually be 16 GiB/s. I had never even seen > 6 GiB/s before in that scenario, so the number stuck in my head.

As for the idea that previous generation CPUs “should be” capable of getting close to peak, it is a very hit or miss situation. Talking about bandwidthTest --memory=pinned results here (single GPU) so we are all on the same page: depending on the chipset systems usually clock in at 3 - 4 GiB/s on PCIe gen2 hardware. Prior to core i7, there were only a select few chipset/configurations that could attain 6 GiB/s.

Some threads that come up on a google search that list bandwidth in connection with a specific chipset
http://forums.nvidia.com/index.php?showtopic=86536
http://forums.nvidia.com/index.php?showtopic=83220
http://forums.nvidia.com/index.php?showtop…mp;#entry463901
http://forums.nvidia.com/index.php?showtopic=82115
http://forums.nvidia.com/index.php?showtop…mp;#entry525435

I’m the user in question. The system is a Core i7 920 OCd to 965 speeds keeping the memory speed to the rated 1333MHz.

For a single GTX280 in this system:

[codebox]hpc-user@gpu-hpc:~$ bandwidthTest --device=0 --memory=pinned --mode=quick

Running on…

  device 0:GeForce GTX 280

Host to Device Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5704.0

Device to Host Bandwidth for Pinned memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 5618.9

[/codebox]

More interesting, for three GTX280s at once:

[codebox]hpc-user@gpu-hpc:~$ bandwidthTest --device=all --memory=pinned --mode=quick

!!!Cumulative Bandwidth to be computed from all the devices !!!

Running on…

  device 0:GeForce GTX 280

  device 1:GeForce GTX 280

  device 2:GeForce GTX 280

Quick Mode

Host to Device Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 17115.1

Quick Mode

Device to Host Bandwidth for Pinned memory

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 16859.3

[/codebox]

I believe these numbers (~17GB/s) represent the bandwidth limits of the triple-channel DDR3 system RAM.

For a single GPU, an X38 or AMD 790FX (better) system can come close to the bandwidths reported above, but for multiple GPUs, nothing comes close to the X58/Core i7 combination from Intel.

Details of the numbers MrAnderson42 referred to can be found at:

http://groups.google.ca/group/hoomd-users/…d07a29946215a93

Hi,

Thanks for clarifying, as I also tested on a Core i7 920 stock spec system with ~ 5.6 GB/sec transfer speed to a single GTX 285. If you have multiple GPUs and come close utilizing all of your chipset’s PCI-e 2 channels, then the system’s memory bandwidth is certainly a concern.

P.S. It sounds very interesting to OC 920 to about 3.2 GHz, how stable is it for CUDA under Linux (if you use it)? Are there any things to be concerned much about or can I simply change the frequency in BIOS and be happy? The case I am using is well ventilated (Armor+), thus overheating might not be a big thing.

The only concern for overclocking the i7 920 was to keep the memory speed within the rated limits of the sticks. No voltage bumps or other fanciness involved.

The system is completely stable with the 920 @ 3.2 GHz for anything thrown at it. The most strenuous CPU test was the HPL benchmark run under OpenMPI on all four cores for 36 hours without issue.

The OS is ubuntu 8.04.2 and CUDA also runs without problems. We have run MisterAnderson42’s HOOMD 0.8.1 validation routine in endurance mode on the three GPUs in this system for upwards of three days.

For more taxing GPU runs, the biggest issue is cooling for the GPUs. The CPU+motherboard generate far less heat than a single GTX280.

Thanks :thumbup: , I shall try this sometimes soon :)

ldpaniak, FYI, the --device=all measurement in bandwidthTest is not a true concurrent bandwidth test (actually it’s not a concurrent bandwidth test in any way whatsoever). If you want a meaningful concurrent bandwidth test (hint: it will be a lot lower than what you saw), use my concurrent bandwidth test. Linux only because I am a lazy person. You can also use my dgemmSweep app (search the forums, it’s here somewhere) for validation of power/cooling over a long period of time.

(just going off the top of my head, the absolute theoretical maximum concurrent bandwidth for a single X58 chipset is 13.5 GB/s–approximately 375MB/s per PCIe lane)

Thanks for the tip.

For completeness:

[codebox]hpc-user@gpu-hpc:~$ ./concBandwidthTest 0 1 2

Device 0 took 2249.793457 ms

Device 1 took 2091.219482 ms

Device 2 took 1128.868896 ms

Average HtoD bandwidth in MB/s: 11574.512451

Device 0 took 2172.550049 ms

Device 1 took 2428.679443 ms

Device 2 took 1404.319092 ms

Average DtoH bandwidth in MB/s: 10138.392578

[/codebox]

Much more reasonable for DDR3-1333 and under the 13.5GB/s limit.

For stress testing systems, we have found that running simultaneous instances of HOOMD lj_liquid_bmark (one for each GPU), each with the maximum particle count that fits in the onboard RAM (~670000 particles for the GTX280), produces the greatest power draw at the wall. Such runs very quickly get the GPUs to full temperature and test case airflow.