We recently purchased some Quadro 4000 GPUs at our company and doing some initial tests, we found a rather disturbing result. By executing the bandwidthTest program that comes with the cuda toolkit we obtain a device-to-device bandwidth of approximately 45 Gb/s, while the peak bandwidth of the specification is 89.6 Gb/s. That seems a big difference. Since the applications we have to execute on that GPU are mainly driven by the rate at which global memory is accessed, this has raised some concerns among us.
My question is: is that an expected result? If not, has anyone any experience on how to optimize that data transfer rate to make it more similar to the peak bandwidth value?
We are using RHL 5.5 and the 3.2 version of the cuda toolkit for linux.
Thanks in advance for your comments.