I’m measuring the bandwidth of device-to-host transfers using the bandwidthTest benchmark contained in GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit (subdirectory 1_Utilities/bandwidthTest). The test performs `MEMCOPY_ITERATIONS`

iterations of a `cudaMemcpy`

to even out fluctuations in the transfer time. However, I am interested in the times each individual iteration takes. For this purpose, I have modified the section of `testDeviceToHostTransfer`

which copies pinned memory:

```
for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++) {
//checkCudaErrors(cudaMemcpyAsync(h_odata, d_idata, memSize,
// cudaMemcpyDeviceToHost, 0));
cudaDeviceSynchronize();
sdkStartTimer(&timer);
checkCudaErrors(cudaMemcpy(h_odata, d_idata, memSize,
cudaMemcpyDeviceToHost));
cudaDeviceSynchronize();
sdkStopTimer(&timer);
this_t = sdkGetTimerValue(&timer);
printf ("this_t: %lf\n", this_t);
elapsedTimeInMs += this_t;
sdkResetTimer(&timer);
//sleep(1.0);
}
```

I have switched from asynchronous to blocking memcpy because I think it is easier for this question. There are synchronization barriers before and after the memcpy to isolate any potential other CUDA operations. When I execute this I get

```
this_t: 3.023000
this_t: 2.205000
this_t: 1.363000
this_t: 1.363000
this_t: 1.369000
this_t: 1.363000
this_t: 1.363000
[...]
```

So the first iteration takes up significantly more time than the other ones. This makes sense, as I guess that some initialization is done for the first call of cudaMemcpy. I’m not sure why the second call also takes longer, but all iterations after that are consistently faster than that.

The sleep command in the last line is added to imitate an application of mine where after each memcpy a rather long calculation is performed. I wanted to estimate the time spent in data transfers from the bandwidth obtained from this benchmark. It turns out to be completely off due to the reason below.

If I add the `sleep`

command for each iteration, I would not expect the measurements to change. However, the observation is entirely different from the first measurement:

```
this_t: 2.991000
this_t: 7.045000
this_t: 7.032000
this_t: 9.080000
this_t: 8.062000
this_t: 7.281000
this_t: 7.072000
this_t: 8.321000
this_t: 7.408000
this_t: 6.978000
this_t: 8.367000
this_t: 7.837000
```

Not only does each iteration after the first one take more time than the first one, they also take significantly longer than without the sleep. I am not able to explain this observation, does anyone know further?