Has anyone successfully transferred from the CPU to the GPU and back off, at speeds greater than 500 MBps continuously . We are doing a comparison between Paged and Paged-Locked/Pinned Memory. We are doing a Real-to-Complex FFT in-place, and are getting median and mean speeds for paged around 370 - 410 MBps and Pinned speeds around 340 - 350 MBps. We can send data in data chunks from 1MB to 712MB. Anyone have working code examples or suggestions. We know pinned is suppose to faster, but we cannot get it working faster. We are not trying to use streams, because we do not want to write our own FFT. Interesting thing to notice is that the first trial for both tests is always faster than the rest. Does anyone have a reason why this may be happening? We are using cudaHostAlloc and cudaMemcpyAsync/cudaMemcpy for pinned tests and malloc/cudaMemcpy for paged test. We are not noticing a huge difference ±5MBs for cudaMemcpyAsync/cudaMemcpy.
Has anyone successfully transferred from the CPU to the GPU and back off, at speeds greater than 500 MBps continuously . We are doing a comparison between Paged and Paged-Locked/Pinned Memory. We are doing a Real-to-Complex FFT in-place, and are getting median and mean speeds for paged around 370 - 410 MBps and Pinned speeds around 340 - 350 MBps. We can send data in data chunks from 1MB to 712MB. Anyone have working code examples or suggestions. We know pinned is suppose to faster, but we cannot get it working faster. We are not trying to use streams, because we do not want to write our own FFT. Interesting thing to notice is that the first trial for both tests is always faster than the rest. Does anyone have a reason why this may be happening? We are using cudaHostAlloc and cudaMemcpyAsync/cudaMemcpy for pinned tests and malloc/cudaMemcpy for paged test. We are not noticing a huge difference ±5MBs for cudaMemcpyAsync/cudaMemcpy.