Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync

Hi everyone,

I’m playing around with CUDA streams for matrix multiplication. The following are the two approaches I’m following:

Approach 1: loop(cudaMemcpyAsync HtoD, kernel launch, cudaMemcpyAsyncDtoH) over n streams
In this approach, I loop over the n streams and, in each iteration, perform the following functions:

  • cudaMemcpyAsync from Host to Device on stream i
  • kernel launch on stream i
  • cudaMemcpyAsync from Device to Host on stream i

In this approach, the execution timeline is as expected. Data copy (H to D and D to H) and kernel execution happen in parallel among different streams. No issues so far.

Execution timeline from nsys-ui:
approach1

Code: Link

Approach 2: loop(cudaMemcpyAsync HtoD), loop(kernel launch), loop(cudaMemcpyAsyncDtoH) over n streams
In this approach, I have 3 loops defined as follows:

  • Loop 1: cudaMemcpyAsync from Host to Device on stream i
  • Loop 2: Launch kernel on stream i
  • Loop 3: cudaMemcpyAsync from device to host on stream i

From my understanding, the kernel execution should begin as soon as the data copy for that stream is complete. However, the kernel execution waits for all streams to complete their data copy before beginning.
Execution timeline from nsys-ui:
approach2

Code: Link

Question: What is causing the kernel on one stream to wait for data copy on other streams?

A blog from 2012 quotes

The good news is that for devices with compute capability 3.5 (the K20 series), the Hyper-Q feature eliminates the need to tailor the launch order, so either approach above will work.

Platform: RTX 3090, CUDA 12.6, Linux Ubuntu 24.04 x86

Are you running on Windows or Linux? I am guessing the former.

I’m on Linux, Ubuntu 24.04 x86.

I would check, whether the pre-conditions for asynchronous memory copies are fully met. Is the host memory pinned?

Perhaps even use

cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc(&p, size, cudaHostAllocMapped);

Are the actual API calls for host-to-device copies done at the shown positions or only the execution of the calls within the streams, i.e. are the host-to-device copies blocking?

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0

What does asyncEngineCount return for your GPU?

  1. Memory is pinned
  2. H to D are non blocking since I’m using cudaMemcpyAsync, also the same code works with the first approach where there is parallel data copy and execution observed.
  3. aSyncEngineCount is 2.

cudaMemcpyAsync has a fallback to synchronous behaviour under certain conditions.

In your Approach 2: The first stream has its kernel executed late, it could overlap with the second to fourth copy (that is what we want). So it is important, whether the kernel invocation was done late on CPU or early on and just executed late.

Could you measure, whether all/one of the memory copies is blocking instead of assuming, they are not? That would help to pinpoint the reason.

I would also try to use separate memory addresses instead of offsets to exclude any false detection of dependence (there should not be).

I wonder whether it could be an artifact, where the copies are really small and therefore finish before the kernel commences execution? The kernel launch overhead on current hardware is something like 2 microseconds. What happens when you make the copies significantly bigger?

I checked the CPU times to see if memcpy were blocking. CPU times are 100x lower than the actual times reported by Nsight. So I’d conclude it is indeed async.
I also tried separating the memory addresses. Observing the same behaviour.

I made the memory copies huge compared to the kernel compute time. The first kernel launch is still after the last memcpyasync is complete. Unsure what is going on.

I am out of ideas what else to check. I briefly looked over the code but spotted no obvious red flag. Come Monday, some other forum participants may be more eagle eyed or have suggestions for further experiments.

Is the overall execution time worse for approach 2? One possibility is wrong timing display in Nsight Systems.

Appreciate your time. I found this in the documentation:

Two commands from different streams cannot run concurrently if any one of the following operations is issued in between them by the host thread:
→ a memory copy between two addresses to the same device memory.

Do you think the HtoD commands qualify for this criteria? I couldn’t completely understand the statement.

For the small matrix size I’m using, it is consistently slower (~15ms) compared to approach 1 (~12ms).

you may be hitting a lazy loading situation

3 Likes

Woah! It was precisely this.

Setting the env variable CUDA_MODULE_LOADING=EAGER fixed it.

Thanks a lot Robert!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.