How to overlap execution of kernels in different streams with copy operations

Hello Forum,

There’s a case that execution of kernels in different streams fail to be overlapped with copy operations in the streams.
e.g.:

kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host

As the code shows, both kernel kernel1 and kernel2 are small enough to be executed simultaneously. But according to the profiling result, kernel1 and kernel2 are executed serially.

If I delete copy operations or assign them to other streams like:
e.g.:

kernel1(stream1)
memCopyAsync(stream3)
kernel2(stream2)
memCopyAsync(stream4)

Both kernel kernel1 and kernel2, even the first copy operation, are executed simultaneously.

I’d like to know how to overlap execution of kernels in different streams with copy operations so that kernel1 and kernel2 run in parallel and results of kernels are transferred to host right after kernel execution complete

kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host

First, cudaMemcpyAsync can still be blocking. In that case, kernel2 will not launch before kernel1 and the first copy are completed. https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior__memcpy-async

Then, streams only indicate dependencies / ordering of operation. Operations in different streams are independent and could be run simultaneously. But there are no guaranties by the CUDA driver that they will actually run simultaneously.

Thanks for your reply!

Maybe I should make it clear that there’s no dependency betweenkernel1 and kernel2.

As the code snippet shows

kernel1(stream1)
memCopyAsync(stream1) // copy kernel1 results back to host
kernel2(stream2)
memCopyAsync(stream2) // copy kernel2 results back to host

kernel1, kernel2 and their copy operations are submitted to the streams, respectively.

If there’s no copy operation in stream1/2, kernel1and kernel2 run parallelly.
But when a copy operation is added into stream1, kernel2 in stream2 is blocked even if that copy operation is not running., as the figure depicts

As I have stated, the programmer can only hint which operations may be run in parallel. The driver is free to ignore them for example when not enough resources are available.

Can you share a minimal runnable code example which shows your observation?

@striker159
Here’s the simplified edition of my code:

__device__ __inline__ void busySleep(clock_t clock_count) 
{
    clock_t start_clock = clock();
    clock_t clock_offset = 0;
    while (clock_offset < clock_count)
    {
        clock_offset = clock() - start_clock;
    }
}
__global__ void addSelfInArr(uint32_t *arr, uint32_t index,uint32_t num){
    arr[index] += num;
    busySleep(50000000); 
    return;
}
#define N 
int main(){
    cudaStream_t stream1,stream2;
    uint32_t *d_a;
    uint32_t *h_a;
    // Memory allocation of d_a and h_a with data length N.
    // Initialize h_a and then copy to d_a
    h_a  =  (uint32_t*)malloc(sizeof(uint32_t) * N);
    cudaMalloc((uint32_t**)&d_a, sizeof(uint32_t) * N);
    //  Initialize stream1 and stream2
  

    addSelfInArr<<<1,1,0,stream1);
    cudaMemcpyAsync(h_a, d_a, sizeof(uint32_t) * N, cudaMemcpyDeviceToHost,stream1);
    addSelfInArr<<<1,1,0,stream2);
    cudaMemcpyAsync(h_a, d_a, sizeof(uint32_t) * N, cudaMemcpyDeviceToHost,stream2);

    return 0;
}

Can you show the allocation of h_a and d_a

The allocation is really simple:

    h_a  =  (uint32_t*)malloc(sizeof(uint32_t) * N);
    cudaMalloc((uint32_t**)&d_a, sizeof(uint32_t) * N);

Thank you. In this case, my previous posting about cudaMemcpyAsync explains your observation. In the linked document, it says "

For transfers from device memory to pageable host memory, the function will return only once the copy has completed."

This means that the second kernel call will never be submitted before the first memcpy is finished, because the CPU is blocked.

Try to replace malloc with cudaMallocHost

1 Like

Thank you very much! Your solution really works very well. It’s awesome!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.