CudaMemcpy Bandwidth is influenced by idle operations

I’m measuring the bandwidth of device-to-host transfers using the bandwidthTest benchmark contained in GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit (subdirectory 1_Utilities/bandwidthTest). The test performs MEMCOPY_ITERATIONS iterations of a cudaMemcpy to even out fluctuations in the transfer time. However, I am interested in the times each individual iteration takes. For this purpose, I have modified the section of testDeviceToHostTransfer which copies pinned memory:

    for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++) {
      //checkCudaErrors(cudaMemcpyAsync(h_odata, d_idata, memSize,
      //                                cudaMemcpyDeviceToHost, 0));
      cudaDeviceSynchronize();
      sdkStartTimer(&timer);
      checkCudaErrors(cudaMemcpy(h_odata, d_idata, memSize,
                                      cudaMemcpyDeviceToHost));
      cudaDeviceSynchronize();
      sdkStopTimer(&timer);
      this_t = sdkGetTimerValue(&timer);
      printf ("this_t: %lf\n", this_t);
      elapsedTimeInMs += this_t;
      sdkResetTimer(&timer);
      //sleep(1.0);
    }

I have switched from asynchronous to blocking memcpy because I think it is easier for this question. There are synchronization barriers before and after the memcpy to isolate any potential other CUDA operations. When I execute this I get

this_t: 3.023000
this_t: 2.205000
this_t: 1.363000
this_t: 1.363000
this_t: 1.369000
this_t: 1.363000
this_t: 1.363000
[...]

So the first iteration takes up significantly more time than the other ones. This makes sense, as I guess that some initialization is done for the first call of cudaMemcpy. I’m not sure why the second call also takes longer, but all iterations after that are consistently faster than that.

The sleep command in the last line is added to imitate an application of mine where after each memcpy a rather long calculation is performed. I wanted to estimate the time spent in data transfers from the bandwidth obtained from this benchmark. It turns out to be completely off due to the reason below.

If I add the sleep command for each iteration, I would not expect the measurements to change. However, the observation is entirely different from the first measurement:

this_t: 2.991000
this_t: 7.045000
this_t: 7.032000
this_t: 9.080000
this_t: 8.062000
this_t: 7.281000
this_t: 7.072000
this_t: 8.321000
this_t: 7.408000
this_t: 6.978000
this_t: 8.367000
this_t: 7.837000

Not only does each iteration after the first one take more time than the first one, they also take significantly longer than without the sleep. I am not able to explain this observation, does anyone know further?

There are other reports of this type of behavior. That is, putting a thread to sleep seems to incur some sort of additional CUDA overhead (re-initialization?) on the next CUDA operation after you wake it up. I don’t have any detailed explanation.

Thanks for your quick answer,

what’s discussed in the other post is kind of what I was also suspecting. I hoped that there might be some way to not put the threads to sleep, but I guess that’s not the case?

I’m not sure what you mean. You are putting the thread to sleep here:

sleep(1.0);

That is not a function provided by CUDA, and if you don’t want to put the thread to sleep, don’t make that function call.