FFT Pinned and Paged SpeedTesting Issue

Has anyone successfully transferred from the CPU to the GPU and back off, at speeds greater than 500 MBps continuously . We are doing a comparison between Paged and Paged-Locked/Pinned Memory. We are doing a Real-to-Complex FFT in-place, and are getting median and mean speeds for paged around 370 - 410 MBps and Pinned speeds around 340 - 350 MBps. We can send data in data chunks from 1MB to 712MB. Anyone have working code examples or suggestions. We know pinned is suppose to faster, but we cannot get it working faster. We are not trying to use streams, because we do not want to write our own FFT. Interesting thing to notice is that the first trial for both tests is always faster than the rest. Does anyone have a reason why this may be happening? We are using cudaHostAlloc and cudaMemcpyAsync/cudaMemcpy for pinned tests and malloc/cudaMemcpy for paged test. We are not noticing a huge difference ±5MBs for cudaMemcpyAsync/cudaMemcpy.

Trial #1 pinned speed: 472.574 MBps
Trial #2 pinned speed: 342.232 MBps
Trial #3 pinned speed: 350.664 MBps
Trial #4 pinned speed: 321.622 MBps
Trial #5 pinned speed: 350.829 MBps
Trial #6 pinned speed: 306.194 MBps
Trial #7 pinned speed: 385.707 MBps
Trial #8 pinned speed: 287.192 MBps
Trial #9 pinned speed: 337.747 MBps
Trial #10 pinned speed: 315.603 MBps

Trial #1 paged speed: 431.757 MBps
Trial #2 paged speed: 299.233 MBps
Trial #3 paged speed: 490.605 MBps
Trial #4 paged speed: 386.506 MBps
Trial #5 paged speed: 274.591 MBps
Trial #6 paged speed: 458.583 MBps
Trial #7 paged speed: 449.042 MBps
Trial #8 paged speed: 343.503 MBps
Trial #9 paged speed: 289.712 MBps
Trial #10 paged speed: 365.547 MBps

System Spec’s

OS: Linux Ubuntu 10.4
CPU: Xenon E5530 2.4GHz 16 CPUs
GPU: 480GTX Nvidia
12 GB Memory

Has anyone successfully transferred from the CPU to the GPU and back off, at speeds greater than 500 MBps continuously . We are doing a comparison between Paged and Paged-Locked/Pinned Memory. We are doing a Real-to-Complex FFT in-place, and are getting median and mean speeds for paged around 370 - 410 MBps and Pinned speeds around 340 - 350 MBps. We can send data in data chunks from 1MB to 712MB. Anyone have working code examples or suggestions. We know pinned is suppose to faster, but we cannot get it working faster. We are not trying to use streams, because we do not want to write our own FFT. Interesting thing to notice is that the first trial for both tests is always faster than the rest. Does anyone have a reason why this may be happening? We are using cudaHostAlloc and cudaMemcpyAsync/cudaMemcpy for pinned tests and malloc/cudaMemcpy for paged test. We are not noticing a huge difference ±5MBs for cudaMemcpyAsync/cudaMemcpy.

Trial #1 pinned speed: 472.574 MBps
Trial #2 pinned speed: 342.232 MBps
Trial #3 pinned speed: 350.664 MBps
Trial #4 pinned speed: 321.622 MBps
Trial #5 pinned speed: 350.829 MBps
Trial #6 pinned speed: 306.194 MBps
Trial #7 pinned speed: 385.707 MBps
Trial #8 pinned speed: 287.192 MBps
Trial #9 pinned speed: 337.747 MBps
Trial #10 pinned speed: 315.603 MBps

Trial #1 paged speed: 431.757 MBps
Trial #2 paged speed: 299.233 MBps
Trial #3 paged speed: 490.605 MBps
Trial #4 paged speed: 386.506 MBps
Trial #5 paged speed: 274.591 MBps
Trial #6 paged speed: 458.583 MBps
Trial #7 paged speed: 449.042 MBps
Trial #8 paged speed: 343.503 MBps
Trial #9 paged speed: 289.712 MBps
Trial #10 paged speed: 365.547 MBps

System Spec’s

OS: Linux Ubuntu 10.4
CPU: Xenon E5530 2.4GHz 16 CPUs
GPU: 480GTX Nvidia
12 GB Memory

What speed does the bandwidth test from the SDK achieve?

What speed does the bandwidth test from the SDK achieve?