how to generate flexible output array for each input data entry

billconan · January 26, 2010, 7:23pm

hello everyone,

sorry, my topic is not clear. what i mean is that i have an array as the input, and for each input entry, i want to generate several outputs. however, the number of the outputs is different for each input value. therefore, the output for each input value is actually a flexible array. i want to read all the outputs back to cpu as an entire large array. and i don’t know how to do that.

the thing i want to implement is pretty similar to a geometry shader program. you are allowed to generate flexible number of output for each input and read them back together.

i don’t know if cuda supports this kind of program, and how to do this.

thanks in advance.

avidday · January 26, 2010, 7:48pm

The biggest hurdle in implementing a kernel which produces a variable (and unknown a priori) number of output values from each thread is the lack of memory allocation inside kernels at the moment. The way around it is to dynamically allocate a large chunk of memory before you run the kernel, and then reserve a big enough piece of that allocation for each block to write results into. Without too much extra code, you can probably get a warp or half warp of threads in each block to do all the writes from a temporary shared memory buffer, so that the writes are coalesced. This all presumes you have some idea of the upper bound of how many output values each thread or block is likely to produce. If you can’t even say that, then things are considerably more complex.

mroberts · January 27, 2010, 10:26am

Hello,

The operation you want to do can be solved using the so-called “scan” algorithm. There has been a bunch of research on the problem you describe because it comes up in so many applications. Fortunately you don’t have to know or care about the details of the research. Check out the CUDPP library and more specifically the cudppCompact( … ) function for an implementation that is fast and ready to use.

Given a CUDA buffer of data values and a CUDA buffer of flags that mark each data value as “keep” or “discard”, the cudppCompact( … ) function returns a densely packed CUDA buffer of only those data values that were marked as “keep”.

Data Buffer: A B C D E F G H I J K

Flag Buffer: 0 1 0 1 1 1 1 0 0 1 0		(1 means "keep" and 0 means "discard")

Output Buffer: B D E F G J

You use this operation to solve your problem by first allocating an intermediate buffer that is big enough to hold the maximum possible number of outputs. If your maximum possible number of outputs per thread is 6 (as is the case with geometry shaders), then your intermediate buffer needs to be 6 * the input size. Each thread can write to a subset of the 6 outputs, and then you can compact the data to make it densely packed afterwards.

The downside to this approach is that you need to allocate an intermediate buffer that is big enough to hold the maximum possible number of intermediate values. To get around this, you can do an initial pass that computes how much output to do per thread. Then you can do a scan operation to determine the appropriate array offset for each thread, and then do a third pass to actually write your output.

Data Buffer:			  A B C D E F G H I J K

Number of Outputs Buffer: 0 2 0 5 4 0 0 0 7 0 0

Scan Buffer:			  0 0 2 2 7 11 11 11 11 18 18 18

What’s important here is that the scan buffer at element i contains the cumulative sum of all elements before i in the number-of-outputs buffer. The total size you need for your final output is given by the last element of the scan buffer (18) + the last element of the number-of-outputs buffer (0). You can use the scan buffer to tell each thread what offset they should use when writing their output values. Conveniently CUDPP also includes code to efficiently generate the scan buffer.

The downside to this approach is that it may require you to pay the cost of determining what to output twice - once to write out how many outputs you need, and again to actually write the outputs. This might not apply to your problem since it might be very cheap to determine “I need 4 outputs” without actually needing to figure out what those 4 outputs are.

I hope that helps! There is a large literature on the parallel scan operation, so if you’re interested you can check out all the recent papers on it. Otherwise you can just use CUDPP to get good performance without having to get into the details :)

cheers,

mike

avidday · January 27, 2010, 10:48am

Perhaps I misread the original poster’s question, but I don’t believe that what he is asking for is a parallel scan.

mroberts · January 27, 2010, 11:08am

It depends on what you’re doing, but if you need variable output per thread and you need the output densely packed then scan is important. If you can’t place an upper bound on the amount of output you need per thread at compile time then scan is also important (because in this case you need the output to be densely packed by definition).

Although I might have misunderstood the original question :)

avidday · January 27, 2010, 11:22am

I agree that a parallel scan can be used to efficiently compact the results into a continuous stream, but I don’t think it addresses the basic question, which seems to be can you have a CUDA kernel which has a 1:many input data to output data ratio, especially when the amount of output isn’t known beforehand.

Topic		Replies	Views
Return variable number of results...? CUDA Programming and Performance	3	2124	August 1, 2008
More Fermi Fun: Output queues CUDA Programming and Performance	2	6167	June 1, 2010
Single Input Multiple Output CUDA Programming and Performance	0	460	June 22, 2018
Variable length data structure in CUDA How to make a variable length data structure like list effici CUDA Programming and Performance	2	4228	January 8, 2009
Kernel computation on multiple array elements CUDA Programming and Performance	5	1092	October 12, 2021
Most efficient way to transform X elements to Y elements CUDA Programming and Performance	7	7066	March 11, 2009
Generating unknown amount of data CUDA Programming and Performance	6	3144	February 22, 2008
Is it possible to write to output randomly? CUDA Programming and Performance	3	2564	April 12, 2008
Unparallizable problem solvable with atomic ops? CUDA Programming and Performance	6	3929	March 12, 2008
selective output versus redundant output which is a better theoretical approach CUDA Programming and Performance	2	1814	September 30, 2008

how to generate flexible output array for each input data entry

Related topics