sorry, my topic is not clear. what i mean is that i have an array as the input, and for each input entry, i want to generate several outputs. however, the number of the outputs is different for each input value. therefore, the output for each input value is actually a flexible array. i want to read all the outputs back to cpu as an entire large array. and i don’t know how to do that.
the thing i want to implement is pretty similar to a geometry shader program. you are allowed to generate flexible number of output for each input and read them back together.
i don’t know if cuda supports this kind of program, and how to do this.
The biggest hurdle in implementing a kernel which produces a variable (and unknown a priori) number of output values from each thread is the lack of memory allocation inside kernels at the moment. The way around it is to dynamically allocate a large chunk of memory before you run the kernel, and then reserve a big enough piece of that allocation for each block to write results into. Without too much extra code, you can probably get a warp or half warp of threads in each block to do all the writes from a temporary shared memory buffer, so that the writes are coalesced. This all presumes you have some idea of the upper bound of how many output values each thread or block is likely to produce. If you can’t even say that, then things are considerably more complex.
The operation you want to do can be solved using the so-called “scan” algorithm. There has been a bunch of research on the problem you describe because it comes up in so many applications. Fortunately you don’t have to know or care about the details of the research. Check out the CUDPP library and more specifically the cudppCompact( … ) function for an implementation that is fast and ready to use.
Given a CUDA buffer of data values and a CUDA buffer of flags that mark each data value as “keep” or “discard”, the cudppCompact( … ) function returns a densely packed CUDA buffer of only those data values that were marked as “keep”.
Data Buffer: A B C D E F G H I J K
Flag Buffer: 0 1 0 1 1 1 1 0 0 1 0 (1 means "keep" and 0 means "discard")
Output Buffer: B D E F G J
You use this operation to solve your problem by first allocating an intermediate buffer that is big enough to hold the maximum possible number of outputs. If your maximum possible number of outputs per thread is 6 (as is the case with geometry shaders), then your intermediate buffer needs to be 6 * the input size. Each thread can write to a subset of the 6 outputs, and then you can compact the data to make it densely packed afterwards.
The downside to this approach is that you need to allocate an intermediate buffer that is big enough to hold the maximum possible number of intermediate values. To get around this, you can do an initial pass that computes how much output to do per thread. Then you can do a scan operation to determine the appropriate array offset for each thread, and then do a third pass to actually write your output.
Data Buffer: A B C D E F G H I J K
Number of Outputs Buffer: 0 2 0 5 4 0 0 0 7 0 0
Scan Buffer: 0 0 2 2 7 11 11 11 11 18 18 18
What’s important here is that the scan buffer at element i contains the cumulative sum of all elements before i in the number-of-outputs buffer. The total size you need for your final output is given by the last element of the scan buffer (18) + the last element of the number-of-outputs buffer (0). You can use the scan buffer to tell each thread what offset they should use when writing their output values. Conveniently CUDPP also includes code to efficiently generate the scan buffer.
The downside to this approach is that it may require you to pay the cost of determining what to output twice - once to write out how many outputs you need, and again to actually write the outputs. This might not apply to your problem since it might be very cheap to determine “I need 4 outputs” without actually needing to figure out what those 4 outputs are.
I hope that helps! There is a large literature on the parallel scan operation, so if you’re interested you can check out all the recent papers on it. Otherwise you can just use CUDPP to get good performance without having to get into the details :)
It depends on what you’re doing, but if you need variable output per thread and you need the output densely packed then scan is important. If you can’t place an upper bound on the amount of output you need per thread at compile time then scan is also important (because in this case you need the output to be densely packed by definition).
Although I might have misunderstood the original question :)
I agree that a parallel scan can be used to efficiently compact the results into a continuous stream, but I don’t think it addresses the basic question, which seems to be can you have a CUDA kernel which has a 1:many input data to output data ratio, especially when the amount of output isn’t known beforehand.