Persistent threads for repeated calculations

Hi all,

I have a molecular simulation that involves energy calculation a few thousand times per second. Since launching a kernel with a set of threads takes a lot of overhead, is it possible to start the threads at the beginning of the process and then have them pick up the input as it becomes available and then notify the host code when the processing is done?

I’m trying to do this with flags set in zero-copy memory by the host and the GPU, which indicate whether to wait, execute or exit (flags[0]). Similarly, flags[1] tells the host to pick up the results or wait. At this point, I have some sample code that doesn’t work (goes into infinite loop until Ctrl-C), but is pretty much a starting point.

Any suggestions would be very much appreciated.


#include <iostream>

#define BLOCKS 1

#define THREADS 8

#define SIZE 10

__global__ void kernel(double* input, double* output, int size, int* flags){

	while (1){

		//check flags

		if (flags[0] == 0) continue;

		if (flags[0] == -1) break;

		//flag == 1 - process input

		int pass = 0;

		while (1){

			int step = THREADS;

			int index = pass * step + threadIdx.x;

			if (index >= size) break;

			double v = input[index];

			for (int i = 0; i < 10; i++){

				v *= 2;


			output[index] = v;



		//completed processing

		flags[0] = 0;//reset back to wait

		flags[1] = 1;//ready for the cpu to pick up



int main(){

	//allocate GPU memory, set initial flags and launch the kernel

	double* dev_input;

	double* dev_output;

	cudaMalloc((void**)&dev_input, SIZE * sizeof(double));

	cudaMalloc((void**)&dev_output, SIZE * sizeof(double));

	int* flags;

	int* dev_flags;

	cudaHostAlloc((void**)&flags, 2 * sizeof(int), cudaHostAllocMapped);

	cudaHostGetDevicePointer(&dev_flags, flags, 0);

	flags[0] = 0;

	flags[1] = 0;

	double* input = new double;

	double* output = new double;

	cudaMemcpy(dev_input, input, SIZE * sizeof(double), cudaMemcpyHostToDevice);

	kernel<<<BLOCKS,THREADS>>>(dev_input, dev_output, SIZE, dev_flags);

	//now go into a loop to use the launched threads

	for (int i = 0; i < 10; i++){

		//populate input data

		for (int j = 0; j < SIZE; j++){

			input[j] = i * j;


		flags[1] = 0;//not done yet

		flags[0] = 1;//execute command

		while (1){

			if (flags[1] >= 0) break;//wait for the threads to finish



		//copy results

		cudaMemcpy(output, dev_output, SIZE * sizeof(double), cudaMemcpyDeviceToHost);

		//check output

		for (int j = 0; j < SIZE; j++){

			printf("%lf\t%lf\n", input[j], output[j]);









Question one: is this possible? Yeah, I wrote a quick test to do CPU/GPU handshaking back and forth about a year ago, and it’s feasible but a giant pain to get right (you have to invalidate and flush all GPU caches, which means it’s slow).

Better question: is this technically legal? I’m not sure. If presented with almost any proposed driver feature that could break this functionality, I’d take the other feature. This means that doing what you’re trying to do could die at any time.

Best question: is this the right way to do whatever it is you’re trying to do? Probably not. I’d bet that just launching a new kernel invocation every time is about the same perf as what you’re trying to do, especially if the kernel runtime is not trivially small, and it’s certainly less error prone.

Thanks, tmurray.
A bit depressing, though… Repeated kernel invocations end up taking a long time (10x of the equivalent cpu code). It also varies on different systems: 3-fold difference between a single GTX480 and a Tesla 2050 on a server with a Tyan board (GTX480 is faster).

Looks like I better stick to the current design of individual launches on every iteration and try to find the bottlenecks.


If you’ve got ECC on the Tesla, that might be part of the problem.

You were right. Disabling ECC on the Tesla did bring the total execution time roughly to the level of the GTX480.
Generally speaking, is it acceptable to keep it disabled for the sake of performance? And why does it impose such a huge overhead?
Finally, do you know how much time on average it takes for a kernel call to instantiate the threads?

Thanks again


Kernel launch is ~3us, sync is approximately the same without ECC, with ECC it goes up significantly to make sure that you haven’t hit any ECC errors before the sync (~20us).

Is kernel launch time influenced by the number/size of parameters at all?

More parameters = longer time.

Thanks, tmurray.
This should give me something to work with.