memCpy : Device to Device VERY SLOW

Buanderie · September 12, 2009, 8:52pm

Hello,
I needed to use cudaMemcpy from a device array to another… and it was incredibly slow, as figured on the visual profiler…
I checked this using the “bandwith” SDK project :
Running on…
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1885.7

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 1949.6

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 0.4

&&&& Test PASSED

Press ENTER to exit…

If I understand I better should copy Device to Host and then Host to Device, instead of Device to Device ?
Hu…

seibert · September 13, 2009, 3:10am

That is extremely abnormal behavior. I don’t have any idea what is wrong, but a GTX 260 should have much higher device to device bandwidth. Something more like 80 to 100 GB/sec is normal.

LSChien · September 13, 2009, 5:58am

Hello,

I needed to use cudaMemcpy from a device array to another… and it was incredibly slow, as figured on the visual profiler…

I checked this using the “bandwith” SDK project :

Running on…
  device 0:GeForce GTX 260
Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1885.7

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1949.6

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 0.4

&&&& Test PASSED

Press ENTER to exit…

If I understand I better should copy Device to Host and then Host to Device, instead of Device to Device ?

Hu…

I have a GTX260, and it works fine as described in the SPEC,

your d2d bandwidth is very strange.

what’s your driver and what’s your memory frequency in NVIDIA control panel

Buanderie · September 13, 2009, 9:29am

Here it is… Screenshot of the nVidia control panel.

External Image

It’s very strange, indeed… External Media I hope it’s not hardware related. Since I put a lot of money in this card (I’m a student…).

Btw, my CUDA Toolkit and SDK version is 2.3 (downloaded it last week).

Any advice from nVidia staff would be welcomed :)

Buanderie · September 13, 2009, 2:19pm

Anyone ? I know this forum is not a helpdesk… but I’d like to continue writing my cellular automata based RNG… This copy operation is biasing all my optimization results… :(

Buanderie · September 13, 2009, 2:24pm

Okay…
The bandwidthTest I ran was in Debug mode (but not EMU mode !)… I tried in Release mode… Here is the result :
Running on…
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2550.2

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2303.3

Quick Mode
Device to Device Bandwidth
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 92093.3

&&&& Test PASSED

Press ENTER to exit…

Question is… Is it a normal behaviour ? Device to Device bandwidth should be the same in Debug and Release mode… As long at it’s not Emulation mode. But I don’t know… Maybe it’s normal. Since I can’t test on other hardware, I can’t say.

LSChien · September 13, 2009, 2:46pm

Okay…

The bandwidthTest I ran was in Debug mode (but not EMU mode !)… I tried in Release mode… Here is the result :

Running on…
  device 0:GeForce GTX 260
Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2550.2

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 2303.3

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 92093.3

&&&& Test PASSED

Press ENTER to exit…

Question is… Is it a normal behaviour ? Device to Device bandwidth should be the same in Debug and Release mode… As long at it’s not Emulation mode. But I don’t know… Maybe it’s normal. Since I can’t test on other hardware, I can’t say.

I compile source code of bandwidthTest in SDK by debug mode, then d2d bandwidth is the same as that by release mode

in GTX295

Buanderie · September 13, 2009, 2:59pm

In fact I checked the CUDA configuration of the .cu file in my VS2008 solution… The bandwidthTest.cu was compiled in Emulation mode… Even though the project was set to device mode.

I feel kind of stupid External Media

Anyway thank you everyone for your support :)

Topic		Replies	Views
Bandwidth is too slow so cudaMemcpy() takes too long. CUDA Programming and Performance	15	7503	December 12, 2012
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	9856	June 30, 2009
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5265	February 26, 2010
bandwidth test CUDA Programming and Performance	9	19057	March 24, 2009
How to calculate memory bandwidth from device properties ? CUDA Programming and Performance	11	5413	June 20, 2015
Low Device to Device Bandwidth CUDA Programming and Performance	11	3412	May 4, 2009
Device to Device cudaMemcpy performance CUDA Programming and Performance cuda	5	10040	March 24, 2021
Performance difference of CUDA in Windows and Linux CUDA Programming and Performance	11	16210	April 15, 2010
Is my bandwidth calculation right? bandwidth CUDA Programming and Performance	3	1447	November 13, 2009
Device to device bandwidth, bandwidth test vs theoretical maximum CUDA Programming and Performance	7	3178	May 27, 2014

memCpy : Device to Device VERY SLOW

Related topics