Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync

rohitnagraj.99 · November 17, 2024, 8:55pm

Hi everyone,

I’m playing around with CUDA streams for matrix multiplication. The following are the two approaches I’m following:

Approach 1: loop(cudaMemcpyAsync HtoD, kernel launch, cudaMemcpyAsyncDtoH) over n streams
In this approach, I loop over the n streams and, in each iteration, perform the following functions:

cudaMemcpyAsync from Host to Device on stream i
kernel launch on stream i
cudaMemcpyAsync from Device to Host on stream i

In this approach, the execution timeline is as expected. Data copy (H to D and D to H) and kernel execution happen in parallel among different streams. No issues so far.

Execution timeline from nsys-ui:

Code: Link

Approach 2: loop(cudaMemcpyAsync HtoD), loop(kernel launch), loop(cudaMemcpyAsyncDtoH) over n streams
In this approach, I have 3 loops defined as follows:

Loop 1: cudaMemcpyAsync from Host to Device on stream i
Loop 2: Launch kernel on stream i
Loop 3: cudaMemcpyAsync from device to host on stream i

From my understanding, the kernel execution should begin as soon as the data copy for that stream is complete. However, the kernel execution waits for all streams to complete their data copy before beginning.
Execution timeline from nsys-ui:

Code: Link

Question: What is causing the kernel on one stream to wait for data copy on other streams?

A blog from 2012 quotes

The good news is that for devices with compute capability 3.5 (the K20 series), the Hyper-Q feature eliminates the need to tailor the launch order, so either approach above will work.

Platform: RTX 3090, CUDA 12.6, Linux Ubuntu 24.04 x86

njuffa · November 17, 2024, 9:20pm

Are you running on Windows or Linux? I am guessing the former.

rohitnagraj.99 · November 17, 2024, 9:23pm

I’m on Linux, Ubuntu 24.04 x86.

Curefab · November 17, 2024, 9:25pm

I would check, whether the pre-conditions for asynchronous memory copies are fully met. Is the host memory pinned?

Perhaps even use

cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc(&p, size, cudaHostAllocMapped);

Are the actual API calls for host-to-device copies done at the shown positions or only the execution of the calls within the streams, i.e. are the host-to-device copies blocking?

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0

What does asyncEngineCount return for your GPU?

rohitnagraj.99 · November 17, 2024, 9:37pm

Memory is pinned
H to D are non blocking since I’m using cudaMemcpyAsync, also the same code works with the first approach where there is parallel data copy and execution observed.
aSyncEngineCount is 2.

Curefab · November 17, 2024, 9:53pm

cudaMemcpyAsync has a fallback to synchronous behaviour under certain conditions.

In your Approach 2: The first stream has its kernel executed late, it could overlap with the second to fourth copy (that is what we want). So it is important, whether the kernel invocation was done late on CPU or early on and just executed late.

Could you measure, whether all/one of the memory copies is blocking instead of assuming, they are not? That would help to pinpoint the reason.

I would also try to use separate memory addresses instead of offsets to exclude any false detection of dependence (there should not be).

njuffa · November 17, 2024, 9:54pm

I wonder whether it could be an artifact, where the copies are really small and therefore finish before the kernel commences execution? The kernel launch overhead on current hardware is something like 2 microseconds. What happens when you make the copies significantly bigger?

rohitnagraj.99 · November 17, 2024, 11:06pm

I checked the CPU times to see if memcpy were blocking. CPU times are 100x lower than the actual times reported by Nsight. So I’d conclude it is indeed async.
I also tried separating the memory addresses. Observing the same behaviour.

rohitnagraj.99 · November 17, 2024, 11:07pm

I made the memory copies huge compared to the kernel compute time. The first kernel launch is still after the last memcpyasync is complete. Unsure what is going on.

njuffa · November 17, 2024, 11:17pm

I am out of ideas what else to check. I briefly looked over the code but spotted no obvious red flag. Come Monday, some other forum participants may be more eagle eyed or have suggestions for further experiments.

Curefab · November 17, 2024, 11:33pm

Is the overall execution time worse for approach 2? One possibility is wrong timing display in Nsight Systems.

rohitnagraj.99 · November 17, 2024, 11:36pm

Appreciate your time. I found this in the documentation:

Two commands from different streams cannot run concurrently if any one of the following operations is issued in between them by the host thread:
→ a memory copy between two addresses to the same device memory.

Do you think the HtoD commands qualify for this criteria? I couldn’t completely understand the statement.

rohitnagraj.99 · November 17, 2024, 11:37pm

For the small matrix size I’m using, it is consistently slower (~15ms) compared to approach 1 (~12ms).

Robert_Crovella · November 18, 2024, 1:33am

you may be hitting a lazy loading situation

rohitnagraj.99 · November 18, 2024, 1:41am

Woah! It was precisely this.

Setting the env variable CUDA_MODULE_LOADING=EAGER fixed it.

Thanks a lot Robert!

system · December 2, 2024, 1:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	1114	February 1, 2022
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1833	June 23, 2010
About Stream control CUDA Programming and Performance	1	989	March 26, 2009
cuda stream CUDA Programming and Performance	3	5915	April 6, 2011
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	542	October 12, 2021
Ordering of cudaMemcpyAsync issued to separate streams CUDA Programming and Performance	4	668	February 5, 2019
cudaMemcpyAsync CUDA Programming and Performance	10	22085	October 16, 2015
Overlap cudaMemcpyAsync and kernel CUDA Programming and Performance	1	544	February 10, 2021
Asynchronous cudaMemcpy host to device not overlapping with kernel CUDA Programming and Performance	4	123	April 10, 2025
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1278	December 15, 2022

Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync

Related topics