Kernels in CUDA streams seems not running in parallel

The code is from pro cuda c programming:

#include "../common/common.h"
#include <stdio.h>
#include <cuda_runtime.h>
#include <stdlib.h>

 * This example demonstrates submitting work to a CUDA stream in breadth-first
 * order. Work submission in breadth-first order prevents false-dependencies
 * from reducing the parallelism of an application. kernel_1, kernel_2,
 * kernel_3, and kernel_4 simply implement identical, dummy computation.
 * Separate kernels are used to make the scheduling of these kernels simpler to
 * visualize in the Visual Profiler.

#define N 300000
#define NSTREAM 4

__global__ void kernel_1()
    double sum = 0.0;

    for(int i = 0; i < N; i++)
        sum = sum + tan(0.1) * tan(0.1);

__global__ void kernel_2()
    double sum = 0.0;

    for(int i = 0; i < N; i++)
        sum = sum + tan(0.1) * tan(0.1);

__global__ void kernel_3()
    double sum = 0.0;

    for(int i = 0; i < N; i++)
        sum = sum + tan(0.1) * tan(0.1);

__global__ void kernel_4()
    double sum = 0.0;

    for(int i = 0; i < N; i++)
        sum = sum + tan(0.1) * tan(0.1);

int main(int argc, char **argv)
    int n_streams = NSTREAM;
    int isize = 1;
    int iblock = 1;
    int bigcase = 0;

    // get argument from command line
    if (argc > 1) n_streams = atoi(argv[1]);

    if (argc > 2) bigcase = atoi(argv[2]);

    float elapsed_time;

    // set up max connectioin
    char * iname = "CUDA_DEVICE_MAX_CONNECTIONS";
    setenv (iname, "32", 1);
    char *ivalue =  getenv (iname);
    printf ("%s = %s\n", iname, ivalue);

    int dev = 0;
    cudaDeviceProp deviceProp;
    CHECK(cudaGetDeviceProperties(&deviceProp, dev));
    printf("> Using Device %d: %s with num_streams %d\n", dev,,

    // check if device support hyper-q
    if (deviceProp.major < 3 || (deviceProp.major == 3 && deviceProp.minor < 5))
        if (deviceProp.concurrentKernels == 0)
            printf("> GPU does not support concurrent kernel execution (SM 3.5 "
                    "or higher required)\n");
            printf("> CUDA kernel runs will be serialized\n");
            printf("> GPU does not support HyperQ\n");
            printf("> CUDA kernel runs will have limited concurrency\n");

    printf("> Compute Capability %d.%d hardware with %d multi-processors\n",
           deviceProp.major, deviceProp.minor, deviceProp.multiProcessorCount);

    // Allocate and initialize an array of stream handles
    cudaStream_t *streams = (cudaStream_t *) malloc(n_streams * sizeof(

    for (int i = 0 ; i < n_streams ; i++)

    // run kernel with more threads
    if (bigcase == 1)
        iblock = 512;
        isize = 1 << 12;

    // set up execution configuration
    dim3 block (iblock);
    dim3 grid  (isize / iblock);
    printf("> grid %d block %d\n", grid.x, block.x);

    // creat events
    cudaEvent_t start, stop;

    // record start event
    CHECK(cudaEventRecord(start, 0));

    // dispatch job with breadth first ordering
    for (int i = 0; i < n_streams; i++)
        kernel_1<<<grid, block, 0, streams[i]>>>();

    for (int i = 0; i < n_streams; i++)
        kernel_2<<<grid, block, 0, streams[i]>>>();

    for (int i = 0; i < n_streams; i++)
        kernel_3<<<grid, block, 0, streams[i]>>>();

    for (int i = 0; i < n_streams; i++)
        kernel_4<<<grid, block, 0, streams[i]>>>();

    // record stop event
    CHECK(cudaEventRecord(stop, 0));

    // calculate elapsed time
    CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));
    printf("Measured time for parallel execution = %.3fs\n",
           elapsed_time / 1000.0f);

    // release all stream
    for (int i = 0 ; i < n_streams ; i++)


    // destroy events

    // reset device

    return 0;

The kernels in different streams supposed to run concurrently but actually not indicate in nsight system. Device: 4060 Laptop.
Does the laptop GPU support HyperQ? Or the non-pro GPU does not support HyperQ? Or the way I do profling using nsys-ui is not correct?

Thank a lot.

How are you running Nsight Systems?

Thanks for your reply. From desktop file (UI) on Ubuntu2204.

@liuyis can you take a look at this one too?

1 Like

Hi @Levi-Z , it seems these kernels run too fast, and their execution time is way less than the CPU-side launch time, so the next kernel hasn’t had a chance to start running before the previous one finishes.

I compile this sample app on my local system and profiled it. On my local system with an RTX 3060 GPU, the kernel takes less than 1us to execute:


But the CPU side kernel launch call takes more than 3us:


It’s probably a mix of increased GPU compute power, and smarter compiler optimization, that caused these kernels to run much faster than when this sample application was introduced.

I tried to find some different dummy kernel implementation online which takes longer to run, and I can now observe overlapped execution. Here’s the new kernel implementation that takes about 100ms to finish on my RTX 3060 GPU:

    float dummyValue = 1.0;
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    // This loop is intentionally designed to waste time.
    // The actual number of iterations needed to reach 100ms will depend on your GPU's performance.
    // You might need to adjust this value.
    for (long long i = 0; i < 1000000; i++) {
        dummyValue += sinf(dummyValue) * tanf(dummyValue);

    // Prevent the compiler from optimizing out the dummy loop.
    if (dummyValue == 2.0) {
        printf("This won't happen, but it prevents optimization: %f\n", dummyValue);

Here’s the profiling result:

as you can see all 4 streams are now running kernels simultaneously.

1 Like

The kernels do not have a side effect so are reduced to null kernels.

LDC R1, c[0x0][0x28]
BRA `(.L_x_0)

Typical launch overhead is 2-8 microseconds. @liuyis correctly showed that by fixing the kernel such that the compiler does not optimize to a null kernel has expected result.

1 Like

Thanks for all you guys detailed demonstration. The CUDA computation power and compiler optimization improved so much that the code from ‘old’ book may not work as expected.

And I verified the idea from @liuyis on my device:

  1. the original kernel launch is about 3.5us.

  2. the original dummy kernel duration is less than 1us (992ns).

  3. After changing the dummy kernel (more computation and avoiding compiler optimization), the overall looking work as expected.

And an additional question about @Greg 's answer: I do see the same SASS code from original kernel in Nsight Compute. And I’m wondering when would the complier optimize the kernel to the null kernel (any documented programming guide for that)? I couldn’t find it in the official c++ programming guide. If you would point out some references for searching and learning I would be very grateful.

Thanks again for all you guys helps.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

If you are compiling to SASS this would be in nvcc. If you are compiling to PTX then this would be done during JIT.

The empty kernel (code snippet I added for kernel_1()) still has a duration but it should be very small resulting in very little observed parallelism.

CUDA 12.4 introduces CUDA Green Contexts (CUDA Driver API :: CUDA Toolkit Documentation). Green contexts allow increased control of SM resources (CU_DEV_RESOURCE_TYPE_SM) to improve concurrent execution.

1 Like