As mentioned in another thread, the topic of Stream Compaction was discussed in GPU Gems 3 in section 39.3.1 on page 866. Unfortunately there is no sample code to go along with this section. Stream compaction is based on the Scan algorithm. The SDK comes with projects that implement scan for large and small arrays which I am sure can be modified to implement stream compaction.
Stream compaction is an important algorithm. It can be used to solve a common problem for GPGPU programming. If you have threads that need to execute a different number of times to generate data, it would be more efficient to run them for a set number of times, generate a list of threads that require additional processing time and then continue processing just those threads.
A simple example where this would be helpfull is a Mandelbrot generator (like the one I posted). It would make sense to run each pixel’s Mandelbrot loop a set number of times. The unfinished pixels could then continue processing for another 100 iterations until iteration limit is reached or all the pixels have completed. Between each pass, stream compaction could be used to gather the IDs of the unfinished threads for additional processing.
A more common example where stream compaction could be use to optimize processing is adaptive sampling. The Mandelbrot generator I posted uses this technique to anti-alias only the pixels that need it. Unfortunately due to the GPGPU architecture, it will end up being a lot less efficient than it could be. The 32 threads in a warp will have to wait if just one of their pixels is outside the adaptive tolerance and needs additional sampling. Stream compaction could be used to generate a list of aliased pixels that could use additional sampling. In this case, stream compaction needs to be done only once and then additional passes can resample the same set of pixels in multiple passes while there is still time remaining in the frame.
I believe that there are many other cases where stream compaction will be useful to optimize algorithms. I think that it should probably be implemented in one of the CUDA libraries in addition to having SDK examples. I imagine there could be a library function that given a base address, stride and data size(1, 2 or 4 bytes) could generate a separate list of indices to the data elements. You could also allow threads to exit with a flag to indicate whether they need additional processing and then automatically peform stream compaction and then re-launch those threads. The threads themselves could control when they exited for additional stream compaction.
The previous thread that mentioned stream compaction for reference: