In the example gpu_packet_processing application there is code
/*
* Some GPUs may require an initial warmup without doing any real operation.
*/
DOCA_LOG_INFO("Warm up CUDA kernels");
DOCA_GPUNETIO_VOLATILE(*cpu_exit_condition) = 1;
kernel_receive_udp(rx_udp_stream, gpu_exit_condition, &udp_queues);
kernel_receive_tcp(rx_tcp_stream, gpu_exit_condition, &tcp_queues, app_cfg.http_server);
kernel_receive_icmp(rx_icmp_stream, gpu_exit_condition, &icmp_queues);
if (app_cfg.http_server)
kernel_http_server(tx_http_server, gpu_exit_condition, &tcp_queues, &http_queues);
cudaStreamSynchronize(rx_udp_stream);
cudaStreamSynchronize(rx_tcp_stream);
cudaStreamSynchronize(rx_icmp_stream);
I did not pay much attention to that but when I tried to create two kernels in my own application with doca_gpu_semaphore as the means of synchronization between the two kernels, I noticed the following:
- The kernels can be launched asyncronously so that the host code continues execution,
- but without warmup, the kernels won’t run in parallel. The kernel that was launched the second does nothing, until the first kernel completes, which it does not, since it reads packets in an infinite while loop (with an exit condition flag).
- If I run the kernels so that they immediately finish and I launch them both again, then during this second launch they do run in parallel
Here is my test code
__global__ void foo(volatile uint32_t* runFlag) {
printf("Kernel foo\n");
while (*runFlag) {
};
}
__global__ void bar(volatile uint32_t* runFlag) {
printf("Kernel bar\n");
while (*runFlag) {
}
}
/**
* If CUDA_MODULE_LOADING is set to EAGER, the kernels will run in parallel during the first launch
* If CUDA_MODULE_LOADING is set to LAZY (default), the kernels will run in a sequence during the first launch.
*
* In both cases the second launch causes them to run in parallel.
*/
void launchParallelTest() {
cudaStream_t streamFoo, streamBar;
cudaStreamCreateWithFlags(&streamFoo, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&streamBar, cudaStreamNonBlocking);
uint32_t* runFlag;
cudaMallocManaged(reinterpret_cast<void **>(&runFlag), sizeof(uint32_t), cudaMemAttachGlobal);
printf("------------\n"
"First launch\n");
runFlag[0] = 1;
foo<<<1, 1, 100, streamFoo>>>(runFlag);
bar<<<1, 1, 100, streamBar>>>(runFlag);
timespec ts = {};
timespec rs = {};
ts.tv_sec = 5;
ts.tv_nsec = 0;
printf("CPU sleep\n");
nanosleep(&ts, &rs);
printf("CPU continue\n");
runFlag[0] = 0;
cudaStreamSynchronize(streamFoo);
printf("stream synchronized: foo\n");
cudaStreamSynchronize(streamBar);
printf("stream synchronized: bar\n");
printf("------------\n"
"Second launch\n");
runFlag[0] = 1;
foo<<<1, 1, 100, streamFoo>>>(runFlag);
bar<<<1, 1, 100, streamBar>>>(runFlag);
printf("CPU sleep\n");
nanosleep(&ts, &rs);
printf("CPU continue\n");
runFlag[0] = 0;
cudaStreamSynchronize(streamFoo);
printf("stream synchronized: foo\n");
cudaStreamSynchronize(streamBar);
printf("stream synchronized: bar\n");
exit(0);
}
If environment variable CUDA_MODULE_LOADING is not set, the above code prints out
------------
First launch
CPU sleep
Kernel foo
CPU continue
stream synchronized: foo
Kernel bar
stream synchronized: bar
------------
Second launch
CPU sleep
Kernel foo
Kernel bar
CPU continue
stream synchronized: foo
stream synchronized: bar
In the above after the first launch, the second kernel does nothing until first one has been synchronized with. During second launch they work as I expected.
If environment variable CUDA_MODULE_LOADING=EAGER, the above code prints out
------------
First launch
CPU sleep
Kernel foo
Kernel bar
CPU continue
stream synchronized: foo
stream synchronized: bar
------------
Second launch
CPU sleep
Kernel foo
Kernel bar
CPU continue
stream synchronized: foo
stream synchronized: bar
In this case during the first launch both of the kernels run in parallel.
I found CUDA_MODULE_LOADING after some Googling when I found this forum post Concurrent kernel execution
First of all, there is no guarantee by CUDA that two kernels in different streams will execute concurrently.
Your specific problem is probably caused by lazy loading. Simply speaking, k1 needs to complete before k2 can be loaded, but k1 cannot complete unless k2 is complete, which creates a deadlock.
Try setting the environment variableCUDA_MODULE_LOADING
toEAGER
There is also reference to CUDA manual CUDA: Concurrent execution which says:
Loading kernels might require context synchronization. Some programs incorrectly treat the possibility of concurrent execution of kernels as a guarantee. In such cases, if program assumes that two kernels will be able to execute concurrently, and one of the kernels will not return without the other kernel executing, there is a possibility of a deadlock.
If kernel A will be spinning in an infinite loop until kernel B is executing. In such case launching kernel B will trigger lazy loading of kernel B. If this loading will require context synchronization, then we have a deadlock: kernel A is waiting for kernel B, but loading kernel B is stuck waiting for kernel A to finish to synchronize the context.
Such program is an anti-pattern, but if for any reason you want to keep it you can do the following:
preload all kernels that you hope to execute concurrently prior to launching them
run application with CUDA_MODULE_DATA_LOADING=EAGER to force loading data eagerly without forcing each function to load eagerly
Setting environmental variableCUDA_MODULE_LOADING
to EAGER
indeed seems to fix the problem, without the need for the warmup kernel runs.
I assume the warmup runs of the kernels causes the module loading, so that the next time they are launched they do run in parallel.
I’m still confused whether assuming that one can run kernels in parallel is an “anti-pattern” or not. DOCA documentation suggests it is the way to go, but the CUDA does not.
And should the DOCA documentation suggest directly that set CUDA_MODULE_LOADING
to EAGER
when using GPUNetIO and semaphores and running multiple kernels in parallel.