speed problem while transfering data from the GPU

I have a CUDA code that I run on several platforms (Vista, XP, Linux, which have different GPUs).

I have a loop where I iterate 10 times and in each iteration I transfer same amount of data back and forth between CPU and GPU (25MBs of data from CPU -> GPU and 25MBs of data from GPU -> CPU). Since I transfer same amount of data, I expect host->device and device->host copy times to be the same.

Host memory is pinned (always using cudaMallocHost).

Vista platform has GeForce 8400M GS, XP platform has GeForce GTX 8800, and Linux platform has GeForce GTX 280.

I noticed that for all the iterations, Cuda Compute Profiler returns almost the same timings on XP and Linux platforms for memcpyHtoD and memcpyDtoH operations as expected.

But on Vista platform the profiler returns unstable results for memcpyDtoH: memcpyDtoH speed (downloading data from GPU) is always slower than memcpyHtoD speed (uploading data to GPU) and the factor of slowness changes between 2x-5x in different iterations. On the other hand memcpyHtoD results are stable and profiler returns almost the same speed for each iteration.

To double check, I used cuda events to time these operations in my code and retrieved results that are same as above: no problems in XP & Linux and unstable memcpyDtoH results in Vista.

Do you think this problem stem from using a different GPU or a operating system? What can I do to fix this problem?


This problem might result from batched kernel execution, where the kernel would only start once the memcpyDtoH is issued. Try inserting a cudaStreamQuery(0) where you want the kernel to launch.

I have added cudaStreamSynchronize(0) before and after following operations:

  • copying data from CPU -> GPU
  • running the kernel
  • copying data from GPU -> CPU

right after each cudaStreamSynchronize(0), I added cudaStreamQuery(0) and made sure that I get cudaSuccess.

But the results for memcpyDtoH are still the same and unstable…

What else can I do to fix this problem?