cudaMemcpy2DAsync not always fully synchronous


I’m calling cudaMemcpy2DAsync like so:

cudaStream_t _gpuCopyStream;
cudaStreamCreateWithFlags(&_gpuCopyStream, cudaStreamNonBlocking);

auto t_start_check1 = std::chrono::high_resolution_clock::now();
cudaMemcpy2DAsync(…, … , …, …, …, cudaMemcpyDeviceToDevice, _gpuCopyStream);
auto t_end_check1 = std::chrono::high_resolution_clock::now();
float total_check1 = std::chrono::duration<float, std::milli>(t_end_check1 - t_start_check1).count();
printf(“Time taken to cross asynchronous function: %f ms.\n”, total_check1);


When I measure the time across the cudaMemcpy2DAsync function the timing seems to spike at times (as shown in the graph below). I can understand why the time taken may spike if it’s a matter of “cudaMemcpyHostToDevice”. However in this case, it’s a “cudaMemcpyDeviceToDevice”.

Any reason for the spikes? Thanks.

Really this is just wild speculation. AFAIK device-to-device cudaMemcpy calls are usually implemented via a device kernel (under the hood). This device kernel launch has to negotiate the same kernel launch queue that other kernels go through. When the kernel launch queue is full, an “asynchronous” kernel launch becomes host-thread-blocking, until a queue spot opens up.

I believe its a possible explanation, based on what you have shown. OTOH it seems unlikely to me, because the kernel launch queue in my experience needs ~1000 or more backed up entries before it is full. So this only seems possible if your code is issuing a large amount of asynchronous work. Your cudaStreamSynchronize() would tend to empty the queue, possibly.