Cuda Memcopy need over 12ms for 16MB

Hallo @ all

I have an array with 4 million elements. The size of every element is 4Byte. So I have an arraysize of 16MByte. Now a copy the array with the cudamemcopy function in the GPU memory. This take over 12ms. Then I make a FFT and copy the results back to the PC-Memory.

My PC have a DDR2 533MHz Memory. How can I accelerate the copy process?

16 MB in 12 ms is 1.3 GB/sec. That is a typical speed for moving data over the PCI-Express bus with pageable memory. Are you using cudaMallocHost() to allocate the host buffer? That will probably double the copy rate.

Also, does your motherboard support PCI-Express 2.0 or just 1.0? PCI-Express 2.0 also nearly doubles the transfer rate again if you are using cudaMallocHost(). (Assuming you have a GPU later than the 8800 GTX/GTS series.)

I have a 9800GTX+. The GPU have PCIe 2.0 but my Mainboard have only PCIe 1.0. I have a PCIex16 Interface on my Mainboard. I think a 16 lane PCIe 1.0 have a 40GB/s banwidth. Every lane have a 2,5GB/s bandwidth.

No I create an array without udaMallocHost() I use only the malloc function for the GPU.

Is a better to use a other memcopy function?

Like many transmission rates, that’s measured in gigabits per second, not gigabytes.

Each PCIE 1.0 lane has a 2 gigabit/s bandwidth, which is 250 megabytes/second.

16 lanes gives a maximum bandwidth of 4 gigabytes/second. In practice you can see anywhere from 1-3 depending on your motherboard and OS.

I use Windows XP 32Bit and Cuda 2.1 Beta and NVIDIA Driver 180.60.

I have tested my bandwidth with NVIDIA BANDWIDTH TEST.exe. My results are 1400MB/S --> 11,2GB/s

Some other users have a GTX260 GPU. His bandwidth is 90GB/s. Can I get more bandwidth with a 64Bit operating system?

The 90 GB/s is the device bandwidth… how fast you can move data from the GPU’s memory into the GPU core.

What you’re interested in is the host->device bandwidth, which is limited by PCIe.

There’s an example project called BandwidthTest in the CUDA SDK, it measures the bandwidths for you.

For example, my laptop gives the following results:

A 64 bit OS will not boost your bandwidth, it’s mostly limited by PCIe.

Using page-locked memory on the host prevents most host-side memory speed issues.

For PCIe 2.0 x16 with a modern G200 card, sometimes the host’s own RAM speed can be a bottleneck even for page-locked RAM,

but that’s rare.

Use cudaMallocHost() to create the host buffer. Then cudaMemcpy() will probably run twice as fast.

I will test it with cuda Malloc Host()

Why is it faster with cuda MallocHost().

because you get a page-locked region with cudaMallocHost, and the card can do a DMA transfer to the card immediately instead of doing a CPU-side memcpy to a page-locked buffer and then DMAing.

Ok I will test it at Monday.