Bandwith Device to Device - FAQ and reality why is it slower?

Hi,

just read the new CUDA FAQ and was very surprised about bandwith test results.

Example measured numbers for a Core 2 Duo processor, 

ASUS P5N32-SLI motherboard with 1GB memory

and a GeForce 8800 GTX are:

                      Pageable     Page-locked

Host - Device   1.7 GB/sec   3.1 GB/sec

Device - Host   1.7 GB/sec   3.1 GB/sec

Device - Device 70.7 GB/sec  70.7 GB/sec

On my office system with 8800 GTX bandwithTest --dtod result is 9.4 GB/sec.

(2x Xeon 3.6 Ghz, Intel E7525 Chipset, 8GB, WinXP pro).

Big difference to the FAQ results.

At home on 8800GTS/640, C2D E6400, ASUS P5LD2-C (i945P), 2GB:

~ 3.5 GB/sec on Linux (it’s a not supported Ubuntu 6.10)

~ 7 GB/sec on WinXP Home

So maybe Mark Harris used a newer CudaKit/SDK with much inprovements ?

How about the release candidate of CUDA 1.0 ?

Yes, these results were run on the CUDA 1.0 release (coming soon!).

Pardon my ignorance Simon, but aren’t those figures for device-device actually device-shared as the fundamental limit for device-device has to be 1/2 the actual total memory bus bandwidth of 86Gb/sec on 8800GTX. Or are these figures quoted not throughput figures of a copy, here from one device memory address range to another device memory address range?
Thanks, Eric
PS: Any up to date idea on how soon?

I think the maximum figure for D->D should be actually the memory bandwidth as I assume the copy is done by the memory controller and does not need to pass data through the multiprocessors.

Peter

The only way the memory controller could do it faster is if memory is banked and it could write one bank while reading another - not a general D-D copy. Unlikely to be anything like this anyway as it is not an operation that is that important so it would not have dedicated hardware. It is not safe to assume anything on the G80!

Eric