Transform feedback in CUDA?

d00sint · May 16, 2007, 7:56am

Hello,
In several algorithms I have found using transform feedback useful when using OpenGL. I can for example send a large list of something to the Geometry shader, discard the elements i don’t care about and have the filtered list returned as a transform feedback.
Is there any efficient way of doing this in CUDA? Thankful for any suggestions,

Erik

Simon_Green · May 16, 2007, 10:48am

The problem you are describing is known as stream compaction. It is possible to implement this on parallel machines using a prefix sum (scan).

The chapter “Stream Reduction Operations for GPGPU Applications” in GPU Gems 2 describes how to do this with the graphics API using a scan followed by a parallel search and gather.

In CUDA you could do it using scan followed by scattered writes.

The basic idea is to scan an array indicating if the stream is empty or not, and then use the results as the addresses to write the final values, e.g.:

data: 0 a 0 0 b c d 0 e 0

       0 1 0 0 1 1 1 0 1 0

scan: 0 1 1 1 2 3 4 4 5 5

result:

1: a

2: b

3: c

4: d

5: e

d00sint · May 16, 2007, 1:30pm

Oh, allright, thank you, I see how it can be done now. However, comparing just the speed of prefix-sum (the example in the CUDA SDK) and some of my transform-feedback code, it seems there’s now way I can reach the same speeds with this implementation. Do you know if this is how the transform feedback is implemented or if the OpenGL implementation has access to some hardware stuff I can’t touch in CUDA? Thank you so much for your help,

Erik Sintorn

The problem you are describing is known as stream compaction. It is possible to implement this on parallel machines using a prefix sum (scan).

The chapter “Stream Reduction Operations for GPGPU Applications” in GPU Gems 2 describes how to do this with the graphics API using a scan followed by a parallel search and gather.

In CUDA you could do it using scan followed by scattered writes.

The basic idea is to scan an array indicating if the stream is empty or not, and then use the results as the addresses to write the final values, e.g.:
data: 0 a 0 0 b c d 0 e 0

       0 1 0 0 1 1 1 0 1 0

scan: 0 1 1 1 2 3 4 4 5 5

result:

1: a

2: b

3: c

4: d

5: e
[snapback]197556[/snapback]

MisterAnderson42 · May 18, 2007, 12:53pm

Could you elaborate just a bit on the best way to do the final step?

It seems that you need only the threads where data[tid] is non-zero to write for this to work without multiple threads writing to the same location. So, lets say I have data, scan, and result all as shared arrays with length equal to the block size.

Then the write would be:

if (data[thid] != 0)

    result[scan[thid]] = data[thid];

Is this the best way to do it? Will the if cause divergent warps?

cdyken · May 18, 2007, 1:36pm

An alternative approach is to use Histopyramid reductions, discussd here thread on gpgpu.org. This technique is particularly efficient when a lot of the data is culled.

The histopyramid construction should benefit considerably from the shared mem of cuda, but the algorithm is a bit hindered by that a cuda-kernel can’t output to a 2d-texture directly without a copy (the algorithm has really good cache-behaviour on 2d-textures).

Though, with some fiddling, creating “chunks” of smaller pyramids etc. I’ve managed to get the performance almost at par with my opengl implementation.

Mark_Harris · May 24, 2007, 3:00pm

BTW, stream compaction using efficient scan in CUDA will be covered in a chapter in the upcoming GPU Gems 3 book.

Mark

cdyken · May 24, 2007, 10:00pm

Cool, I’m looking forward to it!

Seems that there is quite a lot of different strategies to data compaction/expansion. Would be interesting with a shoot-out between the different approaches on a little suite of standarized tests. :-)

Chris

Topic		Replies	Views
Stream Compaction Sample code needed CUDA Programming and Performance	2	4156	July 13, 2011
'coompacting' result array CUDA Programming and Performance	4	4397	May 13, 2010
Comparing transform feedback to CUDA Discussion on the performance differences CUDA Programming and Performance	2	1212	July 5, 2011
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1498	October 30, 2022
Funny development problem CUDA Programming and Performance	2	1160	March 13, 2013
Stream compaction in OpenCL CUDA Programming and Performance	0	890	March 20, 2014
how to parallel this simple algorithm? someone has met this before? CUDA Programming and Performance	2	7287	December 3, 2010
Opposite of Stream Compaction? CUDA Programming and Performance	2	688	February 8, 2013
Optimizing workload with Large amounts of computations but small amount of results CUDA Programming and Performance	1	494	March 25, 2018
efficient stream compaction ? CUDA Programming and Performance	6	2195	December 11, 2009

Transform feedback in CUDA?

Related topics