Kernel doesn't start while perstistent kernel is running

Hi

I have two kernels; the first one is persistent and runs constantly waiting for new tasks; the second one is normal and is started every time a task arrives. Both kernels operate on different streams.

It turns out that the second kernel won’t start until the first kernel ends. It looks like starting a new kernel requires some synchronization with other kernels even though they operate on different streams. I prepared some example that demonstrates the problem:

#include <bits/stdc++.h>
#include <unistd.h>
#include <emmintrin.h>

__global__
void poll_kernel(volatile int *end_flag) {
	while (!*end_flag) {
		__threadfence_system();
	}
}

__global__
void query() {
	printf("hello\n");
}

int main(int argc, char **argv) {
	cudaStream_t stream[2];
	cudaStreamCreate(&stream[0]);
	cudaStreamCreate(&stream[1]);

	volatile int *end_flag_cpu;
	if (cudaMallocHost(&end_flag_cpu, sizeof(*end_flag_cpu)) != cudaSuccess)
		throw std::runtime_error("Failed to allocate host pinned memory.");
	*end_flag_cpu = 0;
	_mm_mfence();

	poll_kernel<<<1, 32, 0, stream[0]>>>(end_flag_cpu);
	sleep(1);
	query<<<1, 1, 0, stream[1]>>>();
	sleep(2);
	*end_flag_cpu = 1;
	_mm_mfence();
	cudaDeviceSynchronize();
}

“hello” should be printed after one second. Instead, it is printed after 3 seconds, just after poll_kernel ends.

I modified the code a little and wrapped query execution in CUDA Graphs:

#include <bits/stdc++.h>
#include <unistd.h>
#include <emmintrin.h>

__global__
void poll_kernel(volatile int *end_flag) {
	while (!*end_flag) {
		__threadfence_system();
	}
}

__global__
void query() {
	printf("hello\n");
}

int main(int argc, char **argv) {
	cudaStream_t stream[2];
	cudaStreamCreate(&stream[0]);
	cudaStreamCreate(&stream[1]);

	cudaGraph_t graph;
	cudaGraphExec_t instance;

	cudaStreamBeginCapture(stream[1], cudaStreamCaptureModeGlobal);

	query<<<1, 1, 0, stream[1]>>>();

	cudaStreamEndCapture(stream[1], &graph);
	cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);

	volatile int *end_flag_cpu;
	if (cudaMallocHost(&end_flag_cpu, sizeof(*end_flag_cpu)) != cudaSuccess)
		throw std::runtime_error("Failed to allocate host pinned memory.");
	*end_flag_cpu = 0;
	_mm_mfence();

	poll_kernel<<<1, 32, 0, stream[0]>>>(end_flag_cpu);
	sleep(1);
	cudaGraphLaunch(instance, stream[1]);
	sleep(2);
	*end_flag_cpu = 1;
	_mm_mfence();
	cudaDeviceSynchronize();
}

And this time it worked like intended - “hello” appeared after 1 second (while poll_kernel was still running).

I have three questions regarding this issue:

  1. Why the first code doesn’t work like intended (“hello” doesn’t appear after 1 second)?
  2. Why the second code fixes the problem? I suspect that preparing the graph somehow loads the kernel which later allows for faster execution and avoids this synchronization that is present in the first example.
  3. What is the correct way of executing kernels while a persistent kernel is running.

The synchronization requirements and behavior for in-kernel printf are not specified well. I would not use in-kernel printf for analyzing this kind of concurrency.

If it were me, I would use some other method to determine that the query kernel executed. According to my tests, removing the in-kernel printf removes the unexpected behavior here. And yes, graphs seems to affect it. Again, the sync behavior of in-kernel printf is not specified well enough to answer this, IMO.