Producer-Consumer paradigm CUDA approach

Hello all!

First of all, I’m new to CUDA, so excuse me if I talk nonsense :whistling:

I was wondering if it’s possible to implement the Producer-Consumer paradigm using CUDA? Or if there is some other approach?

Maybe not the tradicional Producer-Consumer, just some kind of structure that uses a buffer to pass data from one place to another…where there is a reader and a writer…?

How would one implement something like this? I already searched for this but didn’t find much, could you please give me some directions?

Would it be possible to use some kind of Scheduler in the host which gives some execution time to the kernels, which will then try to read/write from a buffer in global memory?

It’s definitely an interesting question, just please don’t double-post (

Probably the simplest way is to allocate a bunch of memory for tasks. The wrap around is where it gets difficult, but you shouldn’t have long-lived kernels anyway. What you could do is make a global queue, though you have to watch out, because global memory is not sequentially consistent. I haven’t spent much time thinking about it, but you could have each producer block own an array for storing tasks, and have one global variable for the head, and another for a “maybe tail” (i.e. optimization to make dequeueing on average constant time). After a producer successfully writes mine := { prev_ptr=head, task_info, grabbed=False, next_ptr=None } to it’s local memory and calls threadfence, it could atomicCAS(head := mine) (and reinitialize prev_ptr and try again if it fails). If blocks produce a variable amount of data, they could initialize a chain in their task memory, and then append the entire chain. Threads dequeuing could theoretically walk the list until they find the last item with grabbed=False, atomicExch it to True, and start consuming if they changed it from False to True. An efficiency optimization could be to set a next_ptr field and keep a closestTail variable, so consumers could walk the chain from the tail.

Another trick you’d probably have to use is making blocks dual personality (i.e. one could become either a producer or consumer, depending on the work that needs to be done).

Something more along the lines of Cilk’s task stealing dequeues would probably work better in practice. For the most part, you don’t want to be producer-consumer chains on the GPU unless you’re processing a large amount of data. atomicCAS to global memory is slow (particularly when you have to fail and try again) . one thing you could try would be to use multiple dequeues and have consumers scan for one (and producers choose one that other producers aren’t using).

Thanks for the reply, I think something like your solution could work…

I’ll just try to be more specific: What I’m trying to accomplish with this is to have a bunch of different operators that perform some task that is produced by the previous operator…

Say you have an operator A and an operator B. A has some input queue attached to it, reads from it and then outputs results to it’s output queue. A’s output queue is B’s input queue. B reads from this queue, performs a task and outputs results to it’s output queue, and so one…

I’m confused on how to implement this using CUDA, as I’m not used to thinking this way :">

Why can’t you have the same block compute B(A(in))?

Well, there’ll probably be a huge amount of continuous input data, and having each operator independent allows to dynamically construct a model.

When I say operator, I’m referring to something that will perform a bunch of operations (like expression filtering and so one).

In a non CUDA point of view, this would be easily implemented having a thread per operator and a shared buffer between two operators…But I know that in CUDA this probably isn’t possible, so Ihave to try different methods…

I’m just clueless…

Okay, so after doing some research, I found out that what I’m trying to implement is actually some kind of Stream Processing…which is basically using streams for kernel communication…

How does one implement something like this in CUDA?

If you wouldn’t mind being more specific about the problem you’re trying to solve, it would help. I’m not sure what you mean by “shared buffer between two operators”. Are you trying to do a parallel reduction?

unfortunately I’m not very familiar with stream processing.


Well, here’s the model I’m trying to implement:

I’m just confused how can I connect does kernels…

okay, let’s name the kernels A, B, C (left to right)

The problem with your picture is that it’s too generic – it doesn’t really describe the data parallelism. Why can’t you just run A, then B, then C? Do you not have enough data to achieve sufficient parallelism? Do you want to write a pipeline instead because you want to avoid latency?

For example, if A does some global permutation [that destroys locality] on a million elements, then it’s only feasible that you wait until that kernel’s done executing before you execute B. On the other hand, if there’s data locality, then you could just compute “a = A(x); C(a, B(a))” in shared memory, which would probably be better.

I went to a lecture a while ago on doing something similar automatically (mapping tasks to processors) – link here…&filtersel=. I haven’t followed up by reading more though… tell me if you come across anything interesting.


The idea is to have data constantly passing through those data streams, so that they can be processed by the kernels, so yeah, I think they do not need to be executed in order, but still, I’m not sure how I could do this by using “a = A(x); C(a, B(a))” …

That’s why I was thinking about those data streams between each pair of kernels to be a shared buffer between them, so that they could process from their input data stream and output the results to their output data stream…Then maybe a scheduler would call kernel B, then C, and so on…

I’m not sure I got you right but perhaps you’re overengineering that.

Data you store in Global memory (physically, the GPU’s RAM) works exactly as it works on CPU. It’s persistent. You could write


kernelA(x, a); //x input, a output

kernelB(a, b); //a input, b output

kernelC(a, b, c); //a, b input, c output

Kernels can’t return values, so you won’t be able to write kernelA( kernelB( … ) ), but you are free to wrap your kernel calls in functions

float* functA(const float *d_x)


	//assume d_x is a pointer to device memory already

	float* d_a;

	cudaMalloc((void**)&d_a, Some_Size); 

	kernelA<<< /* launch parameters */>>>(d_x, d_a);

	return d_a;


Note, it might be more effective to prealloc device memory beforehand.