Bandwidth is too slow so cudaMemcpy() takes too long.

CCuda · November 27, 2012, 2:50pm

As I found other forums in this site related to the memory copy between host and device, I had this issue too. just found it spent too much time on cudaMemcpy() calls but have to. Wondering what costs so run the CUDA 5.0 SDK sample project bandwidthTest, fund the bandwidth as follwing. it looks very slow in my new GTX660Ti card, even not in 1G/s in both direction. I expect more fast. Can anyone explain why?

Chester

Devcie 0: GeForce GTX 660Ti
Qick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 738.1

Device to Host Bandidth, 1 device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 809.6

Device to Device Bandidth, 1 device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 109379.9

zarnick · November 27, 2012, 5:16pm

I don’t know how fast (in theory) it should go, but you could check 2 things:

Try finding the Peak for the 660Ti card, since it’s not a high performance card, this may be just what it takes
How fast is your interface BUS? And your RAM? Remember, that when you go to do a memcpy, you go from ram chip->ram bus->pci bus->card chip and vice-versa (in other words, you can’t rely only on the card itself).

seibert · November 27, 2012, 5:27pm

That bandwidth is abnormally slow for any card. If you have a PCI-Express 2.0 motherboard, the host-to-device and device-to-host bandwidth for pinned memory should be in the range of 4-6 GB/sec.

CCuda · November 27, 2012, 8:03pm

Yes that is what I expected 1 GB/s at least for this card, not a bad card. I viewed other forums a GTX4xx card can reach 5G/s. I just test pageable memeory copy, it only a little slow. definately something wrong I need to find out.

CCuda · November 27, 2012, 8:07pm

This card supoport PCI Express 3.0. How do I find the Motherboard PCI version?

seibert · November 28, 2012, 1:58am

I don’t know of a generic way to check the PCI Express version of the motherboard. If your computer shows the model number of your motherboard at boot, you might be able to look up the specifications that way.

CCuda · November 28, 2012, 1:14pm

Thanks. I checked the BIOS, it didn’t show exactly the PCIe version. Then I found the computer/motherboard spec from web, it is Dell Optiplex 980 and open the box make sure the GTX660Ii is set on the right PCIe x16 slot. So from waht I can tell is both graphic card and the motherboard are PCI Express supported at least with 2.0 (3.0 need special configure as the graphic card memtioned I did not do). What else I can do on this? Would the BIOS involve too need to be upgraded?

seibert · November 28, 2012, 3:22pm

What operating system is this? It sounds very peculiar that pageable memory is fine but pinned memory is slow…

Ailleur · November 28, 2012, 3:56pm

You will be able to check the connectivity of the card with GPUZ
[url]http://www.techpowerup.com/gpuz/[/url]

Note that the power saving features of the card/mobo (don’t know which dictates this behavior) will affect what is displayed in the “Bus interface” label.
To make sure the card is awake, you can run a cuda application while gpuz is running.

CCuda · November 28, 2012, 5:53pm

Yes I run the GPUZ utility and find the Bus Interface is readout “PCI-E 1.1x16@x4.11” does it mean it currently runing at version 1.1 instead of 2.0? The spec of the card suppprts PCIe 2.0 and 3.0.
Even with this Bus Interface, the bandwidth of GPUZ shown “144.2 GB/s” it this the real bandwidth? I also run its Render Test ater runing it didn’t change the Bus Interface setting. Could you explain?

I double check the card, did not find a place can disable the power saving feathres if any. By the way I updated the BIOS and didn’t fix the problem.

Thanks,
Chester

Ailleur · November 28, 2012, 8:56pm

The 144gb/s figure is for device-device transfers. You first post indicates that you get 109gb/s device-device, which sounds about right.

As for GPUZ, it seems to be saying that you are running in pcie 1.1 @ 4x. What you should be seeing is PCI-E 2.0 x 16 @ x16 2.0.

One gotcha for the render test button in GPUZ is that it runs on the graphics card attached to the display (as far as I can tell from my local setup), so if you are running cuda on a card without a monitor attached, you will need to manually start a cuda application on that device to wake it up.

Other than that, I didn’t code GPUZ so I can’t vouch that it’s telling the truth, but that combined with your lost host-to-device test seems to indicate that you are indeed running @ 4x.

CCuda · December 3, 2012, 1:35pm

Yes I confirmed that my GPU card was sit on a “wired as x4” PCIe slot the only slot motherboard provides with the size that card can sit! Than means the bandwidth is not x16 full speed as PCIe v2.0 suppose to be. I also confirmed this with the card vendor. Need to use a different slot or computer. Thanks!

CCuda · December 5, 2012, 7:06pm

Knowing this is not possible but still ask, is there a way to allocate or declare a device memory kept in device/kernel so don’t need to copy memory from host to device each time execute the kernel?

Even with PCI Express 2.0 or 3.0, the data transfer from CPU to GPU is still takes longer on cudaMemcopy() type of functions.

Ailleur · December 5, 2012, 7:13pm

Not really sure what you’re saying here, but the memory that you allocate with cudamalloc has the lifetime of the cuda context. So as long as your context isnt destroyed the memory and its content are available to all kernel calls in the context.

CCuda · December 12, 2012, 7:40pm

I under the device memory allocated by cudaMalloc() has the lifetime. But each time when I call the kernel fuction say, kernelexecute() by I need to use cudaMemcpy() to copy from and to host memory. Since the kernelexecute() is called too frequent so cost too much time on PCIe bus. I wonder if ther is otherway to hold the latest data in the device memeoy or in kernel scope from last kernelexecute() call so can be resued next time… guess this is not the way how does it work…

Ailleur · December 12, 2012, 7:50pm

That’s what I’m saying. What has been written, in global memory, by kernel k1 is available to kernel k2 (or a subsequent call to k1). No need to involve the host in that at all, unless between the call to k1 and k2 you need the host to modify your data : k1 → copy to host → host does something on data → copy to device → k2.

However, if only k1 and k2 work on the data, you can
host → send initial data to device → k1 → k2 → … → k1 → k2 → copy results on host.

Topic		Replies	Views
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	10029	June 30, 2009
Bad PCIe transfer performance (cudaMemcpy), what can cause that? CUDA Programming and Performance	10	11610	September 20, 2010
Bandwidth problem ? Could anyone verify that this is normal? CUDA Programming and Performance	7	3618	April 25, 2008
Memory copy speed CUDA Programming and Performance	3	4439	April 2, 2009
Cuda Memcopy need over 12ms for 16MB CUDA Programming and Performance	11	2751	January 30, 2009
Bandwidht Usage CUDA Programming and Performance	16	8966	October 30, 2008
memCpy : Device to Device VERY SLOW CUDA Programming and Performance	7	2863	September 13, 2009
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68198	April 18, 2008
Host<-> device bandwidth problems slow and intermittent bandwidth on linux CUDA Programming and Performance	9	6751	January 8, 2008
Host to Device Memroy Bandwidth CUDA Programming and Performance	18	8049	September 12, 2008

Bandwidth is too slow so cudaMemcpy() takes too long.

Related topics