I have an array with 4 million elements. The size of every element is 4Byte. So I have an arraysize of 16MByte. Now a copy the array with the cudamemcopy function in the GPU memory. This take over 12ms. Then I make a FFT and copy the results back to the PC-Memory.
My PC have a DDR2 533MHz Memory. How can I accelerate the copy process?
16 MB in 12 ms is 1.3 GB/sec. That is a typical speed for moving data over the PCI-Express bus with pageable memory. Are you using cudaMallocHost() to allocate the host buffer? That will probably double the copy rate.
Also, does your motherboard support PCI-Express 2.0 or just 1.0? PCI-Express 2.0 also nearly doubles the transfer rate again if you are using cudaMallocHost(). (Assuming you have a GPU later than the 8800 GTX/GTS series.)
I have a 9800GTX+. The GPU have PCIe 2.0 but my Mainboard have only PCIe 1.0. I have a PCIex16 Interface on my Mainboard. I think a 16 lane PCIe 1.0 have a 40GB/s banwidth. Every lane have a 2,5GB/s bandwidth.
No I create an array without udaMallocHost() I use only the malloc function for the GPU.
because you get a page-locked region with cudaMallocHost, and the card can do a DMA transfer to the card immediately instead of doing a CPU-side memcpy to a page-locked buffer and then DMAing.