Transform feedback in CUDA?

In several algorithms I have found using transform feedback useful when using OpenGL. I can for example send a large list of something to the Geometry shader, discard the elements i don’t care about and have the filtered list returned as a transform feedback.
Is there any efficient way of doing this in CUDA? Thankful for any suggestions,


The problem you are describing is known as stream compaction. It is possible to implement this on parallel machines using a prefix sum (scan).

The chapter “Stream Reduction Operations for GPGPU Applications” in GPU Gems 2 describes how to do this with the graphics API using a scan followed by a parallel search and gather.

In CUDA you could do it using scan followed by scattered writes.

The basic idea is to scan an array indicating if the stream is empty or not, and then use the results as the addresses to write the final values, e.g.:

data: 0 a 0 0 b c d 0 e 0

       0 1 0 0 1 1 1 0 1 0

scan: 0 1 1 1 2 3 4 4 5 5


1: a

2: b

3: c

4: d

5: e

Oh, allright, thank you, I see how it can be done now. However, comparing just the speed of prefix-sum (the example in the CUDA SDK) and some of my transform-feedback code, it seems there’s now way I can reach the same speeds with this implementation. Do you know if this is how the transform feedback is implemented or if the OpenGL implementation has access to some hardware stuff I can’t touch in CUDA? Thank you so much for your help,

Erik Sintorn

Could you elaborate just a bit on the best way to do the final step?

It seems that you need only the threads where data[tid] is non-zero to write for this to work without multiple threads writing to the same location. So, lets say I have data, scan, and result all as shared arrays with length equal to the block size.

Then the write would be:

if (data[thid] != 0)

    result[scan[thid]] = data[thid];

Is this the best way to do it? Will the if cause divergent warps?

An alternative approach is to use Histopyramid reductions, discussd here thread on This technique is particularly efficient when a lot of the data is culled.

The histopyramid construction should benefit considerably from the shared mem of cuda, but the algorithm is a bit hindered by that a cuda-kernel can’t output to a 2d-texture directly without a copy (the algorithm has really good cache-behaviour on 2d-textures).

Though, with some fiddling, creating “chunks” of smaller pyramids etc. I’ve managed to get the performance almost at par with my opengl implementation.

BTW, stream compaction using efficient scan in CUDA will be covered in a chapter in the upcoming GPU Gems 3 book.


Cool, I’m looking forward to it!

Seems that there is quite a lot of different strategies to data compaction/expansion. Would be interesting with a shoot-out between the different approaches on a little suite of standarized tests. :-)