The producer-consumer pattern is a particular implementation of “task parallelism,” where you break up the processing into stages. That works reasonably well on multicore CPUs and Cell which have a relatively small number of execution units and good inter-core synchronization primitives. CUDA is a data parallel architecture, where you get best results if you break the input data up into chunks that can be processed mostly independently.
To convert a producer-consumer algorithm to CUDA, it is best to collapse all these stages in your producer/consumer chain to one kernel, and then chop up the input data (or the output data slots, depending on how your algorithm looks) into the smallest, independently processable elements. You should then run as many threads as you have elements (yes, this sounds a little crazy, but the CUDA scheduler and memory controller work best when you have thousands of threads in flight). If you hit a CUDA limit on threads, then sticking a loop in your kernel to make each thread process a few data chunks is reasonable.