Very low device to device bandwidth with bandwidth test example from SDK

I’ve noticed in the FAQ that NVIDIA gave some figures about the bandwidths we should get by running the example kernel from the SDK called bandwidthTest. And it made me realize that I’d never tried before. Here is the example given by NVIDIA:

And here is what I got:

I’m kind of very worried about device-device bandwidth which is very low. Any ideas? I’ve got a Dell Precision 690 with a Dual Core Xeon 3.0 GHz and a GeForce 8800 GTX running KUbuntu Edgy (32 bits). And I did the test with CUDA 0.8.

The bandwidth figures from the FAQ are measured with the 0.9 CUDA release.

Fine. I will try with 0.9 soon. Thank you. But the FAQ message with these figures is dated from 22nd of May. The 0.9 version has been released to developers the 5th of June. I know it has been written by a guy from NVIDIA but it wasn’t so obvious to guess it was 0.9.

Thank you anyway. My algorithm is highly dependent of global memory bandwidth, I’m very happy to learn there was a such improvement.