I’m playing around with CUDA streams for matrix multiplication. The following are the two approaches I’m following:
Approach 1: loop(cudaMemcpyAsync HtoD, kernel launch, cudaMemcpyAsyncDtoH) over n streams
In this approach, I loop over the n streams and, in each iteration, perform the following functions:
cudaMemcpyAsync from Host to Device on stream i
kernel launch on stream i
cudaMemcpyAsync from Device to Host on stream i
In this approach, the execution timeline is as expected. Data copy (H to D and D to H) and kernel execution happen in parallel among different streams. No issues so far.
Approach 2: loop(cudaMemcpyAsync HtoD), loop(kernel launch), loop(cudaMemcpyAsyncDtoH) over n streams
In this approach, I have 3 loops defined as follows:
Loop 1: cudaMemcpyAsync from Host to Device on stream i
Loop 2: Launch kernel on stream i
Loop 3: cudaMemcpyAsync from device to host on stream i
From my understanding, the kernel execution should begin as soon as the data copy for that stream is complete. However, the kernel execution waits for all streams to complete their data copy before beginning.
Execution timeline from nsys-ui:
The good news is that for devices with compute capability 3.5 (the K20 series), the Hyper-Q feature eliminates the need to tailor the launch order, so either approach above will work.
Platform: RTX 3090, CUDA 12.6, Linux Ubuntu 24.04 x86
Are the actual API calls for host-to-device copies done at the shown positions or only the execution of the calls within the streams, i.e. are the host-to-device copies blocking?
H to D are non blocking since I’m using cudaMemcpyAsync, also the same code works with the first approach where there is parallel data copy and execution observed.
cudaMemcpyAsync has a fallback to synchronous behaviour under certain conditions.
In your Approach 2: The first stream has its kernel executed late, it could overlap with the second to fourth copy (that is what we want). So it is important, whether the kernel invocation was done late on CPU or early on and just executed late.
Could you measure, whether all/one of the memory copies is blocking instead of assuming, they are not? That would help to pinpoint the reason.
I would also try to use separate memory addresses instead of offsets to exclude any false detection of dependence (there should not be).
I wonder whether it could be an artifact, where the copies are really small and therefore finish before the kernel commences execution? The kernel launch overhead on current hardware is something like 2 microseconds. What happens when you make the copies significantly bigger?
I checked the CPU times to see if memcpy were blocking. CPU times are 100x lower than the actual times reported by Nsight. So I’d conclude it is indeed async.
I also tried separating the memory addresses. Observing the same behaviour.
I made the memory copies huge compared to the kernel compute time. The first kernel launch is still after the last memcpyasync is complete. Unsure what is going on.
I am out of ideas what else to check. I briefly looked over the code but spotted no obvious red flag. Come Monday, some other forum participants may be more eagle eyed or have suggestions for further experiments.
Appreciate your time. I found this in the documentation:
Two commands from different streams cannot run concurrently if any one of the following operations is issued in between them by the host thread:
→ a memory copy between two addresses to the same device memory.
Do you think the HtoD commands qualify for this criteria? I couldn’t completely understand the statement.