Low Device to Device Bandwidth

I have a new i7 machine with Vista 64bit.

For some reason, I am getting poor performances on my bandwidth limitted application.

So, I run bandwidthTest.exe directly from the sdk without any modifications.

For a Quadro 5800 FX, I am getting 56GB/s (reported should be 102GB/s).

I then, replaced that card with a GTX 285 and got 84GB/s (reported should be 159GB/s).

Does any one has a clue what can harm the device to device bandwidth?



Can anyone post the result he is getting by running bandwidthTest.exe for one of the following devices: Quadro FX 5800 or GTX 285?


% …/bin/linux/release/bandwidthTest

Running on…

  device 0:GeForce GTX 285

Quick Mode

Host to Device Bandwidth for Pageable memory


Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2838.4

Quick Mode

Device to Host Bandwidth for Pageable memory


Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2975.3

Quick Mode

Device to Device Bandwidth


Transfer Size (Bytes) Bandwidth(MB/s)

33554432 127582.8

&&&& Test PASSED

Press ENTER to exit…


So, up get 127GB/sec which is still not 159GB/sec as Nvidia reports but it is far better than what I get on the same GTX 285 device (84GB/sec).

What OS are you using?
And what driver version and CUDA version?

Does anyone has a clue why I should have this problem?

It is not a malfunction of the device since I tested two different devices.

What is the output of deviceQuery?
Have you connected both pci-e power connectors?

SUSE 11.1, CUDA 2.2 beta, the driver is the one that goes with CUDA 2.2 beta (I don’t have the version number to hand right now).

Russell, what OS are you using?

What is SUSE?

I am using CUDA 2.1 on Vista 64bit with latest driver from NVIDIA’s site for each of the cards (QUADRO and GEFORCE).

Both connectors are connected to the PSU.

Suse= linux

OpenSUSE is Linux. I’m using version 11.1 (x86-64).

You could try using one of the many Linux distributions supported by CUDA. That would eliminate the Redmond factor.

I have tested the bandwidth of my device using my own copy kernel (and also using cudaMemcpy).

I was surprised to find out that on the 285 GTX, I’m getting 125GB/sec.

So, I guess the problem is in the CUDA SDK test application (bandwidthTest.exe).

So, I’m in a better shape than I thought :-)

Does anyone know why me and Russell get 125GB/sec rather than 159GB/sec (which is what Nvidia reports)?

That’s in fact quite normal. The number nVidia reports is the maximum theoretical bandwidth (bus width x frequency). That’s if the bus is 100% active for every clock cycle while the transfer takes place.

Unfortunately, theory and life rarely match up. Factoring memory latency, and any time the GPU spends recieving commands, the real life result you are getting takes shape.

I’m aware of that and in fact I do not expect to get exactly the theoretical bandwidth.

However, the test I made is using memcpy command which I expect to be extremely optimized. A kernel that implements such a copy operation is extremely short hence don’t use many commands.

Moreover, I run the test on a very large array (150MB). In such a case, the GPU is fully occupied and the latency should be completely hidden. I would like to know what exactly causes the slow down and why.

The device performs context switchs between blocks so it can hide the latency. Is it possible that this is what causing the delay?