Are there good ways to implement a pipeline consisting of severa stages using several kernels/context which executes in parallel? In other word, ideally, I would like to use the multprocessors on the G80 card as an systolic array.
Is it possible for the different stages being data driven without intervention from the CPU such that any time the data arrive at each stage, the particular kernel would wake up (if necessary), run and feed it’s output to the next stage/kernel.
It is not possible to run two kernels concurrently. So this would need a fat kernel that does one of the stages as it sees fit.
But what I understand from the CUDA execution model is that you cannot control the sequence in which blocks are executed anyway (and they might be delayed/interleaved by the thread scheduler on one multiprocessor to hide latency), so the whole thing will not work.
Thanks. I suspected as much. Given that the multiprocessors are scalars, I had hoped that I can implement a systolic array. Oh well.
What would be nice in the future release is more control over multiprocessor allocation rather than a kernel “filling” up as much multiprocessors as it wants so that I can allocate a multiprocessor(s) to run a particular kernel and to be able to chain multiprocessors/blocks together with some basic semaphores capability. That is all I need to implement a steady-state pipeline for applications like video encoding. Even though I might be able to implement a single fat kernel, it’s not very clean architecturally.
yeah the problem with explicit control on the other hand is that it will probably run the whole thing with quite some idling in waiting for buffers to fill. This will waste a lot of ALU power. Given that each multiprocessor is relatively weak (675 MHz with 2 clock cycles per instruction min), I suspect that the overall performance will be disappointing. On the CPU you usually circumvent this by putting a pipeline stage to sleep and therefore freeing its resources temporally giving more CPU timeslices to the active threads. On the GPU this would translate to a suspend/resume possibility for the multiprocessor programs to swap in/out “sleeping” threads or threadblocks. I guess this is far to complicated as it will incur stack pointers etc. Anyone from NVIDIA like to comment on that? :)
All good points. In my favor, video encoding usually has a very deterministic work profile i.e. given the video resolution and the encoding parameters, I know how much work is required for each stage so I can allocate fewer threads or blocks to the computationally cheap tasks (or even loading multiple “cheap” kernels onto a few multiprocessor and allocate the computationally expensive tasks such motion estimation to more multiprocessors. This should reduce amount of stalls that could occur. (or so the theory goes External Image )
Anyway, moot point unless the CUDA framework evolves to allow me to do something that can’t be done today. :(