Callbacks from GPU to CPU


is there a way to implement cheap synchronous callbacks from the GPU kernel to the CPU? The way I currently try to achieve this is through loops and memory locks. The following code illustrates this. I’m aware of the fact that this flies in the face of SIMD and is most probably not a viable solution at all – but bear with me for now, please. Also, please just ignore the [font=“Courier New”]CudaArray[/font] – it’s just a fancy way of handling [font=“Courier New”]cudaMalloc[/font], [font=“Courier New”]cudaMemcpy[/font] et al.

__device__ void wait(unsigned int* memlock) {

	*memlock = WAITING;

	while (*memlock != READY) { }


__global__ void compute(unsigned int* x, unsigned int* locks, unsigned int len) {

	unsigned int const lockidx = blockIdx.x * blockDim.x + threadIdx.x;

	unsigned int const ix = lockidx  * len;

	unsigned int const back = ix + len;

	for (unsigned int i = ix; i < back; ++i) {

		// Do some work

		if (some_condition)



	locks[lockidx] = FINISHED;


This kernel is invoked like this:

unsigned int x[n * len];

unsigned int locks[n] = { READY };

fill_data(x, len, n);

CudaArray<unsigned int> dx(x, x + n * len);

CudaArray<unsigned int> dlocks(locks, locks + n);

compute<<<m, n>>>(dx, dlocks, len);

while (true) {

	unsigned int finished = 0;

	copy(dlocks.begin(), dlocks.end(), locks);

	for (unsigned int i = 0; i != n; ++i) {

		switch (locks[i]) {

			case READY: break;

			case WAITING:

				std::cout << "Signal from thread " << i << std::endl;

				dlocks[i] = READY;


			case FINISHED:

				std::cout << "Thread " << i << " has finished" << std::endl;





	if (finished == n)



Of course, this also hinges on the assumption that I can somehow execute the kernel asynchronously and while it’s running I have access to its memory.

This is currently not possible; there is no way for the GPU to save the context and wait for the CPU to finish, the best you can do (I think) is to leave a command queue in global memory for the CPU to be processed before next kernel invocation.

But he’s not relying on saving the context, he’s just spinlocking on the GPU side. I suppose he would have to rely on a cudaMemcpyAsync and a stream to overlap the two, although I’m not sure if you can add things to a stream while a kernel is running (I assume so, I’m just not actually sure). So, in short, I don’t know if this is possible, but it certainly seems like a very bad idea.

Hi both of you,

thanks for your replies. They more or less confirm what I’ve dreaded. A quick thing:

What do you mean by “add things to a stream”? Do I need to operate on a stream when using [font=“Courier New”]cudaMemcpyAsync[/font]? The reference seems to imply otherwise: “… can optionally be associated to a stream …”. In which case I wouldn’t need to do any stream operations at all here – or do I?

Yes, you’re almost certainly right. I’m currently looking into different strategies to port an algorithm to the GPU … unfortunately, it’s hard to extract components that operate on independent memory locations (write-access!) and, on the same time, have similar control flows. The most promising part so far unfortunately has different threads produce different amount of data and the only way to handle this that I can see is to push this data to the CPU in real-time.

Although I begin to suspect that it would be easier to push the data into a shared memory queue by locking and incrementing a counter.

I don’t understand how cudaMemcpyAsync can be optional about the stream parameter.

The deal with streams is this: Some GPUs let you do memcpys to them while they’re executing a kernel. This can be done by setting up two streams, where one stream runs the kernel, and the other the mempcpy. In your plan, you’d be memcpying endlessly as the kernel is outputting its data. It’s an interesting idea which might work.

I think that if you don’t pass a stream it’s going to be in stream 0 (the special stream, same as where cudaMemcpy appears); it’s just going to be an asynchronous call instead of waiting to complete on the host.