Bandwidth is too slow so cudaMemcpy() takes too long.

As I found other forums in this site related to the memory copy between host and device, I had this issue too. just found it spent too much time on cudaMemcpy() calls but have to. Wondering what costs so run the CUDA 5.0 SDK sample project bandwidthTest, fund the bandwidth as follwing. it looks very slow in my new GTX660Ti card, even not in 1G/s in both direction. I expect more fast. Can anyone explain why?


Devcie 0: GeForce GTX 660Ti
Qick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 738.1

Device to Host Bandidth, 1 device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 809.6

Device to Device Bandidth, 1 device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 109379.9

I don’t know how fast (in theory) it should go, but you could check 2 things:

  1. Try finding the Peak for the 660Ti card, since it’s not a high performance card, this may be just what it takes
  2. How fast is your interface BUS? And your RAM? Remember, that when you go to do a memcpy, you go from ram chip->ram bus->pci bus->card chip and vice-versa (in other words, you can’t rely only on the card itself).

That bandwidth is abnormally slow for any card. If you have a PCI-Express 2.0 motherboard, the host-to-device and device-to-host bandwidth for pinned memory should be in the range of 4-6 GB/sec.

Yes that is what I expected 1 GB/s at least for this card, not a bad card. I viewed other forums a GTX4xx card can reach 5G/s. I just test pageable memeory copy, it only a little slow. definately something wrong I need to find out.

This card supoport PCI Express 3.0. How do I find the Motherboard PCI version?

I don’t know of a generic way to check the PCI Express version of the motherboard. If your computer shows the model number of your motherboard at boot, you might be able to look up the specifications that way.

Thanks. I checked the BIOS, it didn’t show exactly the PCIe version. Then I found the computer/motherboard spec from web, it is Dell Optiplex 980 and open the box make sure the GTX660Ii is set on the right PCIe x16 slot. So from waht I can tell is both graphic card and the motherboard are PCI Express supported at least with 2.0 (3.0 need special configure as the graphic card memtioned I did not do). What else I can do on this? Would the BIOS involve too need to be upgraded?

What operating system is this? It sounds very peculiar that pageable memory is fine but pinned memory is slow…

You will be able to check the connectivity of the card with GPUZ

Note that the power saving features of the card/mobo (don’t know which dictates this behavior) will affect what is displayed in the “Bus interface” label.
To make sure the card is awake, you can run a cuda application while gpuz is running.

Yes I run the GPUZ utility and find the Bus Interface is readout “PCI-E 1.1x16@x4.11” does it mean it currently runing at version 1.1 instead of 2.0? The spec of the card suppprts PCIe 2.0 and 3.0.
Even with this Bus Interface, the bandwidth of GPUZ shown “144.2 GB/s” it this the real bandwidth? I also run its Render Test ater runing it didn’t change the Bus Interface setting. Could you explain?

I double check the card, did not find a place can disable the power saving feathres if any. By the way I updated the BIOS and didn’t fix the problem.


The 144gb/s figure is for device-device transfers. You first post indicates that you get 109gb/s device-device, which sounds about right.

As for GPUZ, it seems to be saying that you are running in pcie 1.1 @ 4x. What you should be seeing is PCI-E 2.0 x 16 @ x16 2.0.

One gotcha for the render test button in GPUZ is that it runs on the graphics card attached to the display (as far as I can tell from my local setup), so if you are running cuda on a card without a monitor attached, you will need to manually start a cuda application on that device to wake it up.

Other than that, I didn’t code GPUZ so I can’t vouch that it’s telling the truth, but that combined with your lost host-to-device test seems to indicate that you are indeed running @ 4x.

Yes I confirmed that my GPU card was sit on a “wired as x4” PCIe slot the only slot motherboard provides with the size that card can sit! Than means the bandwidth is not x16 full speed as PCIe v2.0 suppose to be. I also confirmed this with the card vendor. Need to use a different slot or computer. Thanks!

Knowing this is not possible but still ask, is there a way to allocate or declare a device memory kept in device/kernel so don’t need to copy memory from host to device each time execute the kernel?

Even with PCI Express 2.0 or 3.0, the data transfer from CPU to GPU is still takes longer on cudaMemcopy() type of functions.

Not really sure what you’re saying here, but the memory that you allocate with cudamalloc has the lifetime of the cuda context. So as long as your context isnt destroyed the memory and its content are available to all kernel calls in the context.

I under the device memory allocated by cudaMalloc() has the lifetime. But each time when I call the kernel fuction say, kernelexecute() by I need to use cudaMemcpy() to copy from and to host memory. Since the kernelexecute() is called too frequent so cost too much time on PCIe bus. I wonder if ther is otherway to hold the latest data in the device memeoy or in kernel scope from last kernelexecute() call so can be resued next time… guess this is not the way how does it work…

That’s what I’m saying. What has been written, in global memory, by kernel k1 is available to kernel k2 (or a subsequent call to k1). No need to involve the host in that at all, unless between the call to k1 and k2 you need the host to modify your data : k1 -> copy to host -> host does something on data -> copy to device -> k2.

However, if only k1 and k2 work on the data, you can
host -> send initial data to device -> k1 -> k2 -> … -> k1 -> k2 -> copy results on host.