CudaMemcpy Bandwidth is influenced by idle operations

christian.weiss · March 31, 2023, 11:44am

I’m measuring the bandwidth of device-to-host transfers using the bandwidthTest benchmark contained in GitHub - NVIDIA/cuda-samples: Samples for CUDA Developers which demonstrates features in CUDA Toolkit (subdirectory 1_Utilities/bandwidthTest). The test performs MEMCOPY_ITERATIONS iterations of a cudaMemcpy to even out fluctuations in the transfer time. However, I am interested in the times each individual iteration takes. For this purpose, I have modified the section of testDeviceToHostTransfer which copies pinned memory:

    for (unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++) {
      //checkCudaErrors(cudaMemcpyAsync(h_odata, d_idata, memSize,
      //                                cudaMemcpyDeviceToHost, 0));
      cudaDeviceSynchronize();
      sdkStartTimer(&timer);
      checkCudaErrors(cudaMemcpy(h_odata, d_idata, memSize,
                                      cudaMemcpyDeviceToHost));
      cudaDeviceSynchronize();
      sdkStopTimer(&timer);
      this_t = sdkGetTimerValue(&timer);
      printf ("this_t: %lf\n", this_t);
      elapsedTimeInMs += this_t;
      sdkResetTimer(&timer);
      //sleep(1.0);
    }

I have switched from asynchronous to blocking memcpy because I think it is easier for this question. There are synchronization barriers before and after the memcpy to isolate any potential other CUDA operations. When I execute this I get

this_t: 3.023000
this_t: 2.205000
this_t: 1.363000
this_t: 1.363000
this_t: 1.369000
this_t: 1.363000
this_t: 1.363000
[...]

So the first iteration takes up significantly more time than the other ones. This makes sense, as I guess that some initialization is done for the first call of cudaMemcpy. I’m not sure why the second call also takes longer, but all iterations after that are consistently faster than that.

The sleep command in the last line is added to imitate an application of mine where after each memcpy a rather long calculation is performed. I wanted to estimate the time spent in data transfers from the bandwidth obtained from this benchmark. It turns out to be completely off due to the reason below.

If I add the sleep command for each iteration, I would not expect the measurements to change. However, the observation is entirely different from the first measurement:

this_t: 2.991000
this_t: 7.045000
this_t: 7.032000
this_t: 9.080000
this_t: 8.062000
this_t: 7.281000
this_t: 7.072000
this_t: 8.321000
this_t: 7.408000
this_t: 6.978000
this_t: 8.367000
this_t: 7.837000

Not only does each iteration after the first one take more time than the first one, they also take significantly longer than without the sleep. I am not able to explain this observation, does anyone know further?

Robert_Crovella · March 31, 2023, 1:48pm

There are other reports of this type of behavior. That is, putting a thread to sleep seems to incur some sort of additional CUDA overhead (re-initialization?) on the next CUDA operation after you wake it up. I don’t have any detailed explanation.

christian.weiss · March 31, 2023, 1:53pm

Thanks for your quick answer,

what’s discussed in the other post is kind of what I was also suspecting. I hoped that there might be some way to not put the threads to sleep, but I guess that’s not the case?

Robert_Crovella · March 31, 2023, 1:55pm

I’m not sure what you mean. You are putting the thread to sleep here:

sleep(1.0);

That is not a function provided by CUDA, and if you don’t want to put the thread to sleep, don’t make that function call.

Topic		Replies	Views
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5265	February 26, 2010
About CUDA CUDA Programming and Performance	2	4712	December 3, 2008
Slow memory transfers CUDA Programming and Performance	7	1990	May 23, 2011
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15182	April 29, 2014
Device to Host memcpy How do i make this faster? CUDA Programming and Performance	2	2510	February 6, 2008
Why it is so slow to use cudamemcpy（cudaMemcpyHostToHost）on tx2 Jetson TX2 cuda	5	1169	October 18, 2021
cudaMemcpy half bandwidthTest --memory=pinned ftfm CUDA Programming and Performance	9	10945	October 16, 2010
Very slow memory transfer problem Simple program executes very slowly, bandwidth test shows normal r CUDA Programming and Performance	2	907	February 7, 2011
cudaMemcpyDeviceToHost - slow performance using pinned memory CUDA Programming and Performance	6	2806	June 24, 2016
Copy back to host lasts much longer than copy to device, why? CUDA Programming and Performance	3	677	December 11, 2013

CudaMemcpy Bandwidth is influenced by idle operations

Related topics