Hello,
I needed to use cudaMemcpy from a device array to another… and it was incredibly slow, as figured on the visual profiler…
I checked this using the “bandwith” SDK project :
Running on…
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1885.7
Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1949.6
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 0.4
&&&& Test PASSED
Press ENTER to exit…
If I understand I better should copy Device to Host and then Host to Device, instead of Device to Device ?
Hu…
That is extremely abnormal behavior. I don’t have any idea what is wrong, but a GTX 260 should have much higher device to device bandwidth. Something more like 80 to 100 GB/sec is normal.
Anyone ? I know this forum is not a helpdesk… but I’d like to continue writing my cellular automata based RNG… This copy operation is biasing all my optimization results… :(
Okay…
The bandwidthTest I ran was in Debug mode (but not EMU mode !)… I tried in Release mode… Here is the result :
Running on…
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2550.2
Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2303.3
Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 92093.3
&&&& Test PASSED
Press ENTER to exit…
Question is… Is it a normal behaviour ? Device to Device bandwidth should be the same in Debug and Release mode… As long at it’s not Emulation mode. But I don’t know… Maybe it’s normal. Since I can’t test on other hardware, I can’t say.
In fact I checked the CUDA configuration of the .cu file in my VS2008 solution… The bandwidthTest.cu was compiled in Emulation mode… Even though the project was set to device mode.