Producer-Consumer paradigm CUDA approach

Hello all!

First of all, I’m new to CUDA, so excuse me if I talk nonsense :whistling:

I was wondering if it’s possible to implement the Producer-Consumer paradigm using CUDA? Or if there is some other approach?

Maybe not the tradicional Producer-Consumer, just some kind of structure that uses a buffer to pass data from one place to another…where there is a reader and a writer…?

The producer-consumer pattern is a particular implementation of “task parallelism,” where you break up the processing into stages. That works reasonably well on multicore CPUs and Cell which have a relatively small number of execution units and good inter-core synchronization primitives. CUDA is a data parallel architecture, where you get best results if you break the input data up into chunks that can be processed mostly independently.

To convert a producer-consumer algorithm to CUDA, it is best to collapse all these stages in your producer/consumer chain to one kernel, and then chop up the input data (or the output data slots, depending on how your algorithm looks) into the smallest, independently processable elements. You should then run as many threads as you have elements (yes, this sounds a little crazy, but the CUDA scheduler and memory controller work best when you have thousands of threads in flight). If you hit a CUDA limit on threads, then sticking a loop in your kernel to make each thread process a few data chunks is reasonable.

Oops, looks like I posted in the wrong thread. The other one has much more specific suggestions, so people should look there. :)