Why the cuda kernel and copy do not overlap?

CudaCodec16K · November 5, 2024, 9:40am

#include <cuda_runtime.h>
#include <iostream>

__global__ void kernel(int* data, int value, int iterations) {
    int idx = threadIdx.x;
    for (int i = 0; i < iterations; ++i) {
        data[idx] += value;
    }
}

int main() {
    const int numElements = 10 * 1024 * 1024;
    const int iterations = 1 << 20;
    const int numRounds = 10;

    int* devData1, * devData2;
    int* hostData1, * hostData2;

    cudaMallocHost((void**)&hostData1, numElements * sizeof(int));
    cudaMallocHost((void**)&hostData2, numElements * sizeof(int));
    cudaMalloc((void**)&devData1, numElements * sizeof(int));
    cudaMalloc((void**)&devData2, numElements * sizeof(int));

    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);

    kernel << <1, 1024, 0, stream1 >> > (devData1, 1, iterations);
    cudaStreamSynchronize(stream1);
    cudaMemcpyAsync(hostData1, devData1, numElements * sizeof(int), cudaMemcpyDeviceToHost, stream1);

    kernel << <1, 1024, 0, stream2 >> > (devData2, 2, iterations);
    cudaStreamSynchronize(stream2);
    cudaMemcpyAsync(hostData2, devData2, numElements * sizeof(int), cudaMemcpyDeviceToHost, stream2);

    std::cout << "Test completed." << std::endl;
    return 0;
}

win10
vs2022
cuda 12.6

The above demo code, hope stream1 in copy, stream2 can execute kernel to achieve overlap, but the test results do not achieve overlap. What is the reason?

Robert_Crovella · November 5, 2024, 6:46pm

I think some overlap between these two:

should be possible. When in a WDDM setting, it’s possible that WDDM is causing issues. I sometimes suggest that people try both settings of Hardware Accelerated GPU Scheduling setting, to see if either setting results in observing the desired overlap. You can simply take a google search of that term (Hardware Accelerated GPU Scheduling), take the first blog hit from Microsoft, and use that to guide your study.

system · November 19, 2024, 6:47pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
streams not overlapping CUDA Programming and Performance	1	1553	May 23, 2011
Asynchronous cudaMemcpy host to device not overlapping with kernel CUDA Programming and Performance	4	47	April 10, 2025
Asynchronous multi streaming: not working... CUDA Programming and Performance	2	519	May 13, 2018
why is cudaMemsetAsync(), cudaMemcpyAsync(), or even cudaEventRecord() killing parallel kernel exec CUDA Programming and Performance	2	4690	April 4, 2013
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2259	October 26, 2016
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2852	April 29, 2019
Weird behaviour of CUDA streams CUDA Programming and Performance	0	1895	June 17, 2010
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1778	June 23, 2010
Copies between CPU and GPU CUDA Programming and Performance	8	5373	November 3, 2009
Overlapping kernel execution and data transfer CUDA Programming and Performance	9	3457	May 10, 2017

Why the cuda kernel and copy do not overlap?

Related topics