memCpy : Device to Device VERY SLOW

Hello,
I needed to use cudaMemcpy from a device array to another… and it was incredibly slow, as figured on the visual profiler…
I checked this using the “bandwith” SDK project :
Running on…
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1885.7

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1949.6

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 0.4

&&&& Test PASSED

Press ENTER to exit…

If I understand I better should copy Device to Host and then Host to Device, instead of Device to Device ?
Hu…

That is extremely abnormal behavior. I don’t have any idea what is wrong, but a GTX 260 should have much higher device to device bandwidth. Something more like 80 to 100 GB/sec is normal.

I have a GTX260, and it works fine as described in the SPEC,

your d2d bandwidth is very strange.

what’s your driver and what’s your memory frequency in NVIDIA control panel

Here it is… Screenshot of the nVidia control panel.

It’s very strange, indeed… :/ I hope it’s not hardware related. Since I put a lot of money in this card (I’m a student…).

Btw, my CUDA Toolkit and SDK version is 2.3 (downloaded it last week).

Any advice from nVidia staff would be welcomed :)

Anyone ? I know this forum is not a helpdesk… but I’d like to continue writing my cellular automata based RNG… This copy operation is biasing all my optimization results… :(

Okay…
The bandwidthTest I ran was in Debug mode (but not EMU mode !)… I tried in Release mode… Here is the result :
Running on…
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2550.2

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2303.3

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 92093.3

&&&& Test PASSED

Press ENTER to exit…

Question is… Is it a normal behaviour ? Device to Device bandwidth should be the same in Debug and Release mode… As long at it’s not Emulation mode. But I don’t know… Maybe it’s normal. Since I can’t test on other hardware, I can’t say.

I compile source code of bandwidthTest in SDK by debug mode, then d2d bandwidth is the same as that by release mode

in GTX295

In fact I checked the CUDA configuration of the .cu file in my VS2008 solution… The bandwidthTest.cu was compiled in Emulation mode… Even though the project was set to device mode.

I feel kind of stupid :/

Anyway thank you everyone for your support :)