Why Different Kernels in Different Streams Behave Nearly Serially While Same Kernels Overlap Perfectly?

I’m a CUDA beginner, and I’ve encountered behavior that seems counter-intuitive. When I use multiple streams to execute the same kernel on different data, I observe excellent overlapping execution. However, when I launch different kernels (with no dependencies between them) in different streams, they execute mostly or completely serially, with minimal overlap.

According to my understanding, if kernel execution time is significantly longer than kernel launch overhead, shouldn’t we see near-complete overlap in both scenarios for maximum efficiency? But in my tests, when using multiple streams to asynchronously call independent kernels, they consistently execute almost serially, sometimes even completely serially.

Below is test code I’ve used, along with the nsys profiler screenshot. You can see that the third and fourth kernels should execute in parallel, but they’re almost entirely serialized.

#include <stdio.h>
#include <cuda_runtime.h>


#define N 1024*1024*512
#define THREADS_PER_BLOCK 1024


__global__ void kernel_square(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float val = data[idx];
        for(int i=0; i<500; i++) {  
            val = val * val;
        }
        data[idx] = val;
    }
}


__global__ void kernel_cube(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float val = data[idx];
        for(int i=0; i<500; i++) {
            val = val * val * val;
        }
        data[idx] = val;
    }
}

int main() {
    float *d_data1, *d_data2;
    cudaStream_t stream1, stream2;
    cudaEvent_t start, stop;

    #define CHECK(call) { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            printf("CUDA Error at %s:%d - %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(1); \
        } \
    }
    
    CHECK(cudaMalloc(&d_data1, N*sizeof(float)));
    CHECK(cudaMalloc(&d_data2, N*sizeof(float)));
    CHECK(cudaStreamCreate(&stream1));
    CHECK(cudaStreamCreate(&stream2));
    CHECK(cudaEventCreate(&start));
    CHECK(cudaEventCreate(&stop));


    CHECK(cudaEventRecord(start));
    kernel_square<<<(N + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(d_data1);
    CHECK(cudaGetLastError()); 
    CHECK(cudaDeviceSynchronize()); 
    
    kernel_cube<<<(N + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(d_data2);
    CHECK(cudaGetLastError());
    CHECK(cudaEventRecord(stop));
    CHECK(cudaEventSynchronize(stop));
    
    float sequential_time;
    CHECK(cudaEventElapsedTime(&sequential_time, start, stop));


    CHECK(cudaEventRecord(start));
    kernel_square<<<(N + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK, 0, stream1>>>(d_data1);
    kernel_cube<<<(N + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK, 0, stream2>>>(d_data2);
    CHECK(cudaGetLastError());
    CHECK(cudaEventRecord(stop));
    CHECK(cudaEventSynchronize(stop)); 
    
    float concurrent_time;
    CHECK(cudaEventElapsedTime(&concurrent_time, start, stop));

    CHECK(cudaFree(d_data1));
    CHECK(cudaFree(d_data2));
    CHECK(cudaStreamDestroy(stream1));
    CHECK(cudaStreamDestroy(stream2));
    CHECK(cudaEventDestroy(start));
    CHECK(cudaEventDestroy(stop));

    return 0;
}

For comparison, here’s another test where I call the same kernel type in different streams - you can see they overlap completely:(Sorry, as a new user I’m unable to insert multiple images. What I’m trying to show is two kernels executing with nearly perfect overlap.)

What’s causing this difference in behavior? Is there something I’m missing about how the CUDA scheduler works with different kernel types?

If I want to achieve nearly perfect overlapping execution when asynchronously calling different kernels, what approach should I take?

Using multiple streams does not enforce parallel execution. It merely indicates that the work could be executed in parallel. For example, if a single kernel fully occupies the GPU, there are no resources left to simultaneously run another kernel.

I understand what you mean. I’ve tested many examples and can guarantee that my device has sufficient resources to run multiple kernels simultaneously.

What puzzles me is that in one of my tests, I wrote two kernels (named A and B) with identical computational implementations - the only difference being their names. When I used multiple streams to execute them, launching kernel A twice showed perfect overlapping execution according to nsys profiling. However, when I used the same approach to call kernel A once and kernel B once (essentially just changing the kernel name from the previous test), the execution was almost completely serial. This confuses me.

Could it be that when executing the same kernel, it can reuse some resources? But even if that’s the case, it shouldn’t cause completely serial execution just because the kernels have different names, should it?

In recent CUDA versions, kernel modules are loaded lazily. This is explained in Section 20. in the programming guide: 1. Introduction — CUDA C++ Programming Guide
It can add implicit device synchronization to the first launch of a specific kernel instance.

To load all kernel modules at program startup, set CUDA_MODULE_DATA_LOADING=EAGER for your program.

Thank you very much for your reply! This issue has been bothering me for a long time, and I’m finally getting some help! Thank you so much!!

I’ve examined the lazy loading content you mentioned in detail. My situation is that the code I’m actually trying to execute isn’t .cu code, but is generated by MLIR (a compiler framework). However, I believe the execution principles should be the same - it ultimately executes by calling cuxxxxx() functions, and can use either JIT or static compilation methods.

I’m not sure if my understanding is correct: Due to the existence of lazy loading, modules are only loaded when the kernel needs to execute, and since cuModuleLoad involves implicit synchronization, if modules are only loaded when the kernel needs to execute, this synchronization would cause serialized execution. Is this understanding correct?

If my understanding is correct, I’ve already considered this issue. What if both kernels belong to the same module? I’ve already loaded the module using cuModuleLoadData before executing the two kernels . In this case, shouldn’t the implicit synchronization issue be avoided?

Here’s an example I tested, showing kernel execution from nsys:

You can see that the module has already been loaded before the two matmul executions, and there are no API calls during the execution of the two matmul kernels. Yet they still execute serially. However, if I change the second kernel to call matmul1, they overlap as expected.(These two kernels are the same as previously mentioned - their computational logic is identical, with the only difference being their names.)

Based on your perspective or experience, what do you think is causing this issue?

If there’s anything I haven’t described clearly, or if you need additional information from me, please let me know!

I succeeded!
Just as you said, it was indeed caused by lazy loading. It seems I had misunderstood.
Thank u!!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.