How to overlap execution of kernels in different streams with copy operations

Aeroman2333 · January 18, 2022, 8:34am

Hello Forum,

There’s a case that execution of kernels in different streams fail to be overlapped with copy operations in the streams.
e.g.:

kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host

As the code shows, both kernel kernel1 and kernel2 are small enough to be executed simultaneously. But according to the profiling result, kernel1 and kernel2 are executed serially.

If I delete copy operations or assign them to other streams like:
e.g.:

kernel1(stream1)
memCopyAsync(stream3)
kernel2(stream2)
memCopyAsync(stream4)

Both kernel kernel1 and kernel2, even the first copy operation, are executed simultaneously.

I’d like to know how to overlap execution of kernels in different streams with copy operations so that kernel1 and kernel2 run in parallel and results of kernels are transferred to host right after kernel execution complete

kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host

striker159 · January 18, 2022, 9:23am

First, cudaMemcpyAsync can still be blocking. In that case, kernel2 will not launch before kernel1 and the first copy are completed. https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

Then, streams only indicate dependencies / ordering of operation. Operations in different streams are independent and could be run simultaneously. But there are no guaranties by the CUDA driver that they will actually run simultaneously.

Aeroman2333 · January 18, 2022, 10:49am

Thanks for your reply!

Maybe I should make it clear that there’s no dependency betweenkernel1 and kernel2.

As the code snippet shows

kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host

kernel1, kernel2 and their copy operations are submitted to the streams, respectively.

If there’s no copy operation in stream1/2, kernel1and kernel2 run parallelly.
But when a copy operation is added into stream1, kernel2 in stream2 is blocked even if that copy operation is not running., as the figure depicts

striker159 · January 18, 2022, 10:55am

As I have stated, the programmer can only hint which operations may be run in parallel. The driver is free to ignore them for example when not enough resources are available.

Can you share a minimal runnable code example which shows your observation?

Aeroman2333 · January 18, 2022, 11:17am

@striker159
Here’s the simplified edition of my code:

__device__ __inline__ void busySleep(clock_t clock_count) 
{
    clock_t start_clock = clock();
    clock_t clock_offset = 0;
    while (clock_offset < clock_count)
    {
        clock_offset = clock() - start_clock;
    }
}
__global__ void addSelfInArr(uint32_t *arr, uint32_t index,uint32_t num){
    arr[index] += num;
    busySleep(50000000); 
    return;
}
#define N 
int main(){
    cudaStream_t stream1,stream2;
    uint32_t *d_a;
    uint32_t *h_a;
    // Memory allocation of d_a and h_a with data length N.
    // Initialize h_a and then copy to d_a
    h_a  =  (uint32_t*)malloc(sizeof(uint32_t) * N);
    cudaMalloc((uint32_t**)&d_a, sizeof(uint32_t) * N);
    //  Initialize stream1 and stream2
  

    addSelfInArr<<<1,1,0,stream1);
    cudaMemcpyAsync(h_a, d_a, sizeof(uint32_t) * N, cudaMemcpyDeviceToHost,stream1);
    addSelfInArr<<<1,1,0,stream2);
    cudaMemcpyAsync(h_a, d_a, sizeof(uint32_t) * N, cudaMemcpyDeviceToHost,stream2);

    return 0;
}

striker159 · January 18, 2022, 11:21am

Can you show the allocation of h_a and d_a

Aeroman2333 · January 18, 2022, 11:28am

The allocation is really simple:

    h_a  =  (uint32_t*)malloc(sizeof(uint32_t) * N);
    cudaMalloc((uint32_t**)&d_a, sizeof(uint32_t) * N);

striker159 · January 18, 2022, 11:39am

Thank you. In this case, my previous posting about cudaMemcpyAsync explains your observation. In the linked document, it says "

For transfers from device memory to pageable host memory, the function will return only once the copy has completed."

This means that the second kernel call will never be submitted before the first memcpy is finished, because the CPU is blocked.

Try to replace malloc with cudaMallocHost

Aeroman2333 · January 18, 2022, 11:47am

Thank you very much! Your solution really works very well. It’s awesome!

system · February 1, 2022, 11:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2949	December 18, 2008
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1044	December 15, 2022
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2237	October 26, 2016
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1766	June 23, 2010
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	95	November 18, 2024
Concurrent memcpy and kernel execution CUDA Programming and Performance	5	1415	December 9, 2014
Syncronization with cuda Streams CUDA Programming and Performance cuda	8	419	October 12, 2021
Asynchronous HtoD memtransfer need to have it asynchronous for cpu, but synchronous for the GPU CUDA Programming and Performance	6	1013	September 9, 2010
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	886	July 7, 2017
Overhead of using more than one streams? CUDA Programming and Performance	5	6175	April 14, 2009

How to overlap execution of kernels in different streams with copy operations

Related topics