Why Different Kernels in Different Streams Behave Nearly Serially While Same Kernels Overlap Perfectly?

lilil111 · March 16, 2025, 3:45am

I’m a CUDA beginner, and I’ve encountered behavior that seems counter-intuitive. When I use multiple streams to execute the same kernel on different data, I observe excellent overlapping execution. However, when I launch different kernels (with no dependencies between them) in different streams, they execute mostly or completely serially, with minimal overlap.

According to my understanding, if kernel execution time is significantly longer than kernel launch overhead, shouldn’t we see near-complete overlap in both scenarios for maximum efficiency? But in my tests, when using multiple streams to asynchronously call independent kernels, they consistently execute almost serially, sometimes even completely serially.

Below is test code I’ve used, along with the nsys profiler screenshot. You can see that the third and fourth kernels should execute in parallel, but they’re almost entirely serialized.

#include <stdio.h>
#include <cuda_runtime.h>


#define N 1024*1024*512
#define THREADS_PER_BLOCK 1024


__global__ void kernel_square(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float val = data[idx];
        for(int i=0; i<500; i++) {  
            val = val * val;
        }
        data[idx] = val;
    }
}


__global__ void kernel_cube(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        float val = data[idx];
        for(int i=0; i<500; i++) {
            val = val * val * val;
        }
        data[idx] = val;
    }
}

int main() {
    float *d_data1, *d_data2;
    cudaStream_t stream1, stream2;
    cudaEvent_t start, stop;

    #define CHECK(call) { \
        cudaError_t err = call; \
        if (err != cudaSuccess) { \
            printf("CUDA Error at %s:%d - %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
            exit(1); \
        } \
    }
    
    CHECK(cudaMalloc(&d_data1, N*sizeof(float)));
    CHECK(cudaMalloc(&d_data2, N*sizeof(float)));
    CHECK(cudaStreamCreate(&stream1));
    CHECK(cudaStreamCreate(&stream2));
    CHECK(cudaEventCreate(&start));
    CHECK(cudaEventCreate(&stop));


    CHECK(cudaEventRecord(start));
    kernel_square<<<(N + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(d_data1);
    CHECK(cudaGetLastError()); 
    CHECK(cudaDeviceSynchronize()); 
    
    kernel_cube<<<(N + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(d_data2);
    CHECK(cudaGetLastError());
    CHECK(cudaEventRecord(stop));
    CHECK(cudaEventSynchronize(stop));
    
    float sequential_time;
    CHECK(cudaEventElapsedTime(&sequential_time, start, stop));


    CHECK(cudaEventRecord(start));
    kernel_square<<<(N + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK, 0, stream1>>>(d_data1);
    kernel_cube<<<(N + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK, 0, stream2>>>(d_data2);
    CHECK(cudaGetLastError());
    CHECK(cudaEventRecord(stop));
    CHECK(cudaEventSynchronize(stop)); 
    
    float concurrent_time;
    CHECK(cudaEventElapsedTime(&concurrent_time, start, stop));

    CHECK(cudaFree(d_data1));
    CHECK(cudaFree(d_data2));
    CHECK(cudaStreamDestroy(stream1));
    CHECK(cudaStreamDestroy(stream2));
    CHECK(cudaEventDestroy(start));
    CHECK(cudaEventDestroy(stop));

    return 0;
}

For comparison, here’s another test where I call the same kernel type in different streams - you can see they overlap completely:(Sorry, as a new user I’m unable to insert multiple images. What I’m trying to show is two kernels executing with nearly perfect overlap.)

What’s causing this difference in behavior? Is there something I’m missing about how the CUDA scheduler works with different kernel types?

If I want to achieve nearly perfect overlapping execution when asynchronously calling different kernels, what approach should I take?

striker159 · March 16, 2025, 9:52am

Using multiple streams does not enforce parallel execution. It merely indicates that the work could be executed in parallel. For example, if a single kernel fully occupies the GPU, there are no resources left to simultaneously run another kernel.

lilil111 · March 16, 2025, 10:02am

I understand what you mean. I’ve tested many examples and can guarantee that my device has sufficient resources to run multiple kernels simultaneously.

What puzzles me is that in one of my tests, I wrote two kernels (named A and B) with identical computational implementations - the only difference being their names. When I used multiple streams to execute them, launching kernel A twice showed perfect overlapping execution according to nsys profiling. However, when I used the same approach to call kernel A once and kernel B once (essentially just changing the kernel name from the previous test), the execution was almost completely serial. This confuses me.

Could it be that when executing the same kernel, it can reuse some resources? But even if that’s the case, it shouldn’t cause completely serial execution just because the kernels have different names, should it?

striker159 · March 16, 2025, 10:41am

In recent CUDA versions, kernel modules are loaded lazily. This is explained in Section 20. in the programming guide: 1. Introduction — CUDA C++ Programming Guide
It can add implicit device synchronization to the first launch of a specific kernel instance.

To load all kernel modules at program startup, set CUDA_MODULE_DATA_LOADING=EAGER for your program.

lilil111 · March 16, 2025, 12:45pm

Thank you very much for your reply! This issue has been bothering me for a long time, and I’m finally getting some help! Thank you so much!!

I’ve examined the lazy loading content you mentioned in detail. My situation is that the code I’m actually trying to execute isn’t .cu code, but is generated by MLIR (a compiler framework). However, I believe the execution principles should be the same - it ultimately executes by calling cuxxxxx() functions, and can use either JIT or static compilation methods.

I’m not sure if my understanding is correct: Due to the existence of lazy loading, modules are only loaded when the kernel needs to execute, and since cuModuleLoad involves implicit synchronization, if modules are only loaded when the kernel needs to execute, this synchronization would cause serialized execution. Is this understanding correct?

If my understanding is correct, I’ve already considered this issue. What if both kernels belong to the same module? I’ve already loaded the module using cuModuleLoadData before executing the two kernels . In this case, shouldn’t the implicit synchronization issue be avoided?

Here’s an example I tested, showing kernel execution from nsys:

You can see that the module has already been loaded before the two matmul executions, and there are no API calls during the execution of the two matmul kernels. Yet they still execute serially. However, if I change the second kernel to call matmul1, they overlap as expected.(These two kernels are the same as previously mentioned - their computational logic is identical, with the only difference being their names.)

Based on your perspective or experience, what do you think is causing this issue?

If there’s anything I haven’t described clearly, or if you need additional information from me, please let me know!

lilil111 · March 16, 2025, 1:12pm

I succeeded!
Just as you said, it was indeed caused by lazy loading. It seems I had misunderstood.
Thank u!!

system · March 30, 2025, 1:13pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2237	October 26, 2016
My streams are not running concurrently CUDA Programming and Performance	7	1775	March 6, 2018
Kernels executing concurrently in different streams do not behave as expected CUDA Programming and Performance	6	376	December 20, 2023
What could cause kernel execution to not overlap on different streams? CUDA Programming and Performance	8	2142	June 1, 2017
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4569	February 6, 2009
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	906	June 18, 2010
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2655	April 29, 2019
A few new to CUDA questions CUDA Programming and Performance	3	1112	February 4, 2011
Multiple kernels in flight? CUDA Programming and Performance	19	26839	August 28, 2007
Concurrent kernel execution CUDA Programming and Performance	2	328	March 26, 2024

Why Different Kernels in Different Streams Behave Nearly Serially While Same Kernels Overlap Perfectly?

Related topics