The kernel warm-up phase in gpu_packet_processing example and concurrent kernels

Evotique · March 5, 2025, 1:29pm

In the example gpu_packet_processing application there is code

/*
 * Some GPUs may require an initial warmup without doing any real operation.
 */
DOCA_LOG_INFO("Warm up CUDA kernels");
DOCA_GPUNETIO_VOLATILE(*cpu_exit_condition) = 1;
kernel_receive_udp(rx_udp_stream, gpu_exit_condition, &udp_queues);
kernel_receive_tcp(rx_tcp_stream, gpu_exit_condition, &tcp_queues, app_cfg.http_server);
kernel_receive_icmp(rx_icmp_stream, gpu_exit_condition, &icmp_queues);
if (app_cfg.http_server)
    kernel_http_server(tx_http_server, gpu_exit_condition, &tcp_queues, &http_queues);
cudaStreamSynchronize(rx_udp_stream);
cudaStreamSynchronize(rx_tcp_stream);
cudaStreamSynchronize(rx_icmp_stream);

I did not pay much attention to that but when I tried to create two kernels in my own application with doca_gpu_semaphore as the means of synchronization between the two kernels, I noticed the following:

The kernels can be launched asyncronously so that the host code continues execution,
but without warmup, the kernels won’t run in parallel. The kernel that was launched the second does nothing, until the first kernel completes, which it does not, since it reads packets in an infinite while loop (with an exit condition flag).
If I run the kernels so that they immediately finish and I launch them both again, then during this second launch they do run in parallel

Here is my test code

__global__ void foo(volatile uint32_t* runFlag) {
    printf("Kernel foo\n");
    while (*runFlag) {
    };
}

__global__ void bar(volatile uint32_t* runFlag) {
    printf("Kernel bar\n");
    while (*runFlag) {
    }
}

/**
 * If CUDA_MODULE_LOADING is set to EAGER, the kernels will run in parallel during the first launch
 * If CUDA_MODULE_LOADING is set to LAZY (default), the kernels will run in a sequence during the first launch.
 *
 * In both cases the second launch causes them to run in parallel.
 */
void launchParallelTest() {
    cudaStream_t streamFoo, streamBar;
    cudaStreamCreateWithFlags(&streamFoo, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&streamBar, cudaStreamNonBlocking);

    uint32_t* runFlag;
    cudaMallocManaged(reinterpret_cast<void **>(&runFlag), sizeof(uint32_t), cudaMemAttachGlobal);
    printf("------------\n"
          "First launch\n");
    runFlag[0] = 1;
    foo<<<1, 1, 100, streamFoo>>>(runFlag);
    bar<<<1, 1, 100, streamBar>>>(runFlag);

    timespec ts = {};
    timespec rs = {};
    ts.tv_sec = 5;
    ts.tv_nsec = 0;

    printf("CPU sleep\n");
    nanosleep(&ts, &rs);
    printf("CPU continue\n");
    runFlag[0] = 0;

    cudaStreamSynchronize(streamFoo);
    printf("stream synchronized: foo\n");
    cudaStreamSynchronize(streamBar);
    printf("stream synchronized: bar\n");

    printf("------------\n"
          "Second launch\n");
    runFlag[0] = 1;
    foo<<<1, 1, 100, streamFoo>>>(runFlag);
    bar<<<1, 1, 100, streamBar>>>(runFlag);

    printf("CPU sleep\n");
    nanosleep(&ts, &rs);
    printf("CPU continue\n");
    runFlag[0] = 0;

    cudaStreamSynchronize(streamFoo);
    printf("stream synchronized: foo\n");
    cudaStreamSynchronize(streamBar);
    printf("stream synchronized: bar\n");

    exit(0);
}

If environment variable CUDA_MODULE_LOADING is not set, the above code prints out

------------
First launch
CPU sleep
Kernel foo
CPU continue
stream synchronized: foo
Kernel bar
stream synchronized: bar
------------
Second launch
CPU sleep
Kernel foo
Kernel bar
CPU continue
stream synchronized: foo
stream synchronized: bar

In the above after the first launch, the second kernel does nothing until first one has been synchronized with. During second launch they work as I expected.

If environment variable CUDA_MODULE_LOADING=EAGER, the above code prints out

------------
First launch
CPU sleep
Kernel foo
Kernel bar
CPU continue
stream synchronized: foo
stream synchronized: bar
------------
Second launch
CPU sleep
Kernel foo
Kernel bar
CPU continue
stream synchronized: foo
stream synchronized: bar

In this case during the first launch both of the kernels run in parallel.

I found CUDA_MODULE_LOADING after some Googling when I found this forum post Concurrent kernel execution

First of all, there is no guarantee by CUDA that two kernels in different streams will execute concurrently.

Your specific problem is probably caused by lazy loading. Simply speaking, k1 needs to complete before k2 can be loaded, but k1 cannot complete unless k2 is complete, which creates a deadlock.
Try setting the environment variable CUDA_MODULE_LOADING to EAGER

There is also reference to CUDA manual CUDA: Concurrent execution which says:

Loading kernels might require context synchronization. Some programs incorrectly treat the possibility of concurrent execution of kernels as a guarantee. In such cases, if program assumes that two kernels will be able to execute concurrently, and one of the kernels will not return without the other kernel executing, there is a possibility of a deadlock.

If kernel A will be spinning in an infinite loop until kernel B is executing. In such case launching kernel B will trigger lazy loading of kernel B. If this loading will require context synchronization, then we have a deadlock: kernel A is waiting for kernel B, but loading kernel B is stuck waiting for kernel A to finish to synchronize the context.

Such program is an anti-pattern, but if for any reason you want to keep it you can do the following:

preload all kernels that you hope to execute concurrently prior to launching them

run application with CUDA_MODULE_DATA_LOADING=EAGER to force loading data eagerly without forcing each function to load eagerly

Setting environmental variableCUDA_MODULE_LOADING to EAGER indeed seems to fix the problem, without the need for the warmup kernel runs.

I assume the warmup runs of the kernels causes the module loading, so that the next time they are launched they do run in parallel.

I’m still confused whether assuming that one can run kernels in parallel is an “anti-pattern” or not. DOCA documentation suggests it is the way to go, but the CUDA does not.

And should the DOCA documentation suggest directly that set CUDA_MODULE_LOADING to EAGER when using GPUNetIO and semaphores and running multiple kernels in parallel.

Levei_Luo · April 1, 2025, 10:24am

Hello, the question is a cross between DOCA and CUDA. If you have entitlement of NVIDIA products (e.g., BlueField, ConnectX), (ex , BlueField, ConnectX ), you can contact NVIDIA Enterprise Support: NVIDIA Enterprise Customer Support

Topic		Replies	Views
Waiting for particular kernel CUDA Programming and Performance	1	2729	September 11, 2007
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20209	May 4, 2007
CUDA Parallel Kernels CUDA Programming and Performance	2	6896	May 24, 2009
Multiple kernels in flight? CUDA Programming and Performance	19	26911	August 28, 2007
unable to get the cpu and gpu to run in parallel CUDA Programming and Performance	34	23329	October 7, 2010
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6703	July 2, 2010
Multi kernels CUDA Programming and Performance	13	4708	July 3, 2007
kernel launches in the same stream CUDA Programming and Performance	4	5270	September 22, 2010
Is it possible to execute kernels in parallel CUDA Programming and Performance	9	4609	February 6, 2009
send two kernels at one time what will happen? CUDA Programming and Performance	9	3643	July 4, 2008

The kernel warm-up phase in gpu_packet_processing example and concurrent kernels

Related topics