Idle time between MemCopy instances

The Problem

I’ve been working on optimizing a CUDA application and noticed significant idle time between memory operations in my code. Using Nsight Systems, I observed idle periods ranging from 10-30 microseconds between these three memory operations:

cudaMemcpyToSymbol(d_syndrome_const, syndrome, num_checks * sizeof(int));
cudaMemcpy(d_posterior_odds, d_posterior_odds_reset, num_errors * sizeof(T), cudaMemcpyDeviceToDevice);
cudaMemcpy(d_error_to_check, error_to_check, num_edges * sizeof(T), cudaMemcpyHostToDevice);

These operations include:

  1. A host-to-device symbol (constant memory) copy
  2. A device-to-device memory copy
  3. A host-to-device memory copy

What I’ve Tried

I initially attempted to use CUDA streams to execute these operations concurrently:

cudaStream_t stream;
cudaStreamCreate(&stream);

cudaMemcpyToSymbolAsync(d_syndrome_const, syndrome, num_checks * sizeof(int), 0, cudaMemcpyHostToDevice, stream);
cudaMemcpyAsync(d_posterior_odds, d_posterior_odds_reset, num_errors * sizeof(T), cudaMemcpyDeviceToDevice, stream);
cudaMemcpyAsync(d_error_to_check, error_to_check, num_edges * sizeof(T), cudaMemcpyHostToDevice, stream);

However, I still observed idle time between operations. I also considered creating a unified kernel to handle all three operations at once, but realized this approach wouldn’t solve the fundamental issue since I’d still need to perform the initial host-to-device transfers before running the kernel.

Questions for the Community

  1. Is it possible to completely eliminate the idle time between these memory operations?
  2. Are there any specific tricks to force CUDA to schedule these memory operations with minimal gaps?
  3. Has anyone experienced similar issues when combining different types of memory operations (symbol copies, host-to-device, device-to-device)?
  4. Are there any hardware-level considerations I should be aware of that might be causing these gaps?

Any insights or suggestions would be greatly appreciated. I’m particularly interested in solutions that don’t require fundamentally restructuring my application’s memory layout.

You could try CUDA graphs?

Is the system platform Linux or Windows? If Windows, are you running with the WDDM driver or the TCC driver? Generally speaking, running on Windows with the WDDM driver often gives rise to (seemingly random) execution flow artifacts.

From some quick experiments, the maximum throughput of host → device cudaMemcpyAsync() calls on a reasonably fast modern host platform is about one every microsecond, while the maximum throughput of cudaMemcpy()is about one every three microseconds. So in a timeline, I would consecutive calls to show up 1 microsecond apart for the former, 3 microseconds apart for the latter.

A cudaMemcpy() device → device is equivalent to a kernel launch. The maximum rate at which kernels can be issued is about one every three microseconds using modern system platforms and GPUs.

What are the specifications of the host platform, in particular the CPU? Any delays that exist on the host side in issuing work to the GPU will tend to be exacerbated on a slow host system. Whether this plays a role here I cannot tell based on the information provided.

It is not clear from the information presented that mixing different types of copies has any material effect on the observed overhead. Are you saying that you tried a controlled experiment in which the different types of copies were replaced with instances of the same type of copy, and that this resulted in lower observed overhead?

As for the cudaMemcpyToSymbol() call, would it be replaceable by use of large kernel parameter: