Cuda Memcopy need over 12ms for 16MB

Pimbolie1979 · January 30, 2009, 3:16pm

Hallo @ all

I have an array with 4 million elements. The size of every element is 4Byte. So I have an arraysize of 16MByte. Now a copy the array with the cudamemcopy function in the GPU memory. This take over 12ms. Then I make a FFT and copy the results back to the PC-Memory.

My PC have a DDR2 533MHz Memory. How can I accelerate the copy process?

seibert · January 30, 2009, 3:21pm

16 MB in 12 ms is 1.3 GB/sec. That is a typical speed for moving data over the PCI-Express bus with pageable memory. Are you using cudaMallocHost() to allocate the host buffer? That will probably double the copy rate.

Also, does your motherboard support PCI-Express 2.0 or just 1.0? PCI-Express 2.0 also nearly doubles the transfer rate again if you are using cudaMallocHost(). (Assuming you have a GPU later than the 8800 GTX/GTS series.)

Pimbolie1979 · January 30, 2009, 3:32pm

I have a 9800GTX+. The GPU have PCIe 2.0 but my Mainboard have only PCIe 1.0. I have a PCIex16 Interface on my Mainboard. I think a 16 lane PCIe 1.0 have a 40GB/s banwidth. Every lane have a 2,5GB/s bandwidth.

No I create an array without udaMallocHost() I use only the malloc function for the GPU.

Pimbolie1979 · January 30, 2009, 3:48pm

Is a better to use a other memcopy function?

SPWorley · January 30, 2009, 4:32pm

Like many transmission rates, that’s measured in gigabits per second, not gigabytes.

Each PCIE 1.0 lane has a 2 gigabit/s bandwidth, which is 250 megabytes/second.

16 lanes gives a maximum bandwidth of 4 gigabytes/second. In practice you can see anywhere from 1-3 depending on your motherboard and OS.

Pimbolie1979 · January 30, 2009, 4:42pm

I use Windows XP 32Bit and Cuda 2.1 Beta and NVIDIA Driver 180.60.

I have tested my bandwidth with NVIDIA BANDWIDTH TEST.exe. My results are 1400MB/S → 11,2GB/s

Some other users have a GTX260 GPU. His bandwidth is 90GB/s. Can I get more bandwidth with a 64Bit operating system?

SPWorley · January 30, 2009, 5:16pm

The 90 GB/s is the device bandwidth… how fast you can move data from the GPU’s memory into the GPU core.

What you’re interested in is the host->device bandwidth, which is limited by PCIe.

There’s an example project called BandwidthTest in the CUDA SDK, it measures the bandwidths for you.

For example, my laptop gives the following results:

E:\CUDA\bin\win32\Release>bandwidthTest.exe

Running on…
  device 0:Quadro FX 570M
Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 994.7

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 1100.8

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes) Bandwidth(MB/s)

33554432 12544.1

&&&& Test PASSED

Press ENTER to exit…

A 64 bit OS will not boost your bandwidth, it’s mostly limited by PCIe.

Using page-locked memory on the host prevents most host-side memory speed issues.

For PCIe 2.0 x16 with a modern G200 card, sometimes the host’s own RAM speed can be a bottleneck even for page-locked RAM,

but that’s rare.

seibert · January 30, 2009, 5:30pm

Use cudaMallocHost() to create the host buffer. Then cudaMemcpy() will probably run twice as fast.

Pimbolie1979 · January 30, 2009, 5:36pm

I will test it with cuda Malloc Host()

Pimbolie1979 · January 30, 2009, 8:59pm

Why is it faster with cuda MallocHost().

tmurray · January 30, 2009, 9:01pm

because you get a page-locked region with cudaMallocHost, and the card can do a DMA transfer to the card immediately instead of doing a CPU-side memcpy to a page-locked buffer and then DMAing.

Pimbolie1979 · January 30, 2009, 9:03pm

Ok I will test it at Monday.

Topic		Replies	Views
CudaMemcpy() speed/bandwidth For host to device CUDA Programming and Performance	5	10029	June 30, 2009
Memory copy improvement ? CUDA Programming and Performance	6	3123	April 25, 2012
Bandwidth is too slow so cudaMemcpy() takes too long. CUDA Programming and Performance	15	7561	December 12, 2012
Bad PCIe transfer performance (cudaMemcpy), what can cause that? CUDA Programming and Performance	10	11610	September 20, 2010
The speed of data transfer between GPU and CPU CUDA Programming and Performance	4	2693	April 27, 2009
Memory copy speed CUDA Programming and Performance	3	4439	April 2, 2009
cudaMemcpy half bandwidthTest --memory=pinned ftfm CUDA Programming and Performance	9	10976	October 16, 2010
Accelerate Host <-> Device Memory Transfer Besides CudaMallocHost CUDA Programming and Performance	7	5226	March 4, 2009
Highly variant memcpyAsync bandwidth on Tesla C2050 pinned memory, async memcpy CUDA Programming and Performance	6	4691	October 24, 2011
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11337	July 8, 2009

Cuda Memcopy need over 12ms for 16MB

Related topics