Removing elements from a global array written across blocks

I’m hoping someone could give me some advice on how to solve a problem I’ve run into…

I’m planning out the implementation of an algorithm that has multiple stages.

Currently, each stage is mapped to a kernel where each thread block does some processing on a single element in an array.
If there is more work to be done on an element after the current stage is completed, thread 0 of the block will write the index of the element into out[blockIdx.x], else it will write a 0.

For the next stage, I only need to launch blocks for the elements that need to be processed further. This number can be determined by using the ‘out’ array from the last stage.

However, I’m having a hard time finding a way for the blocks to determine which element to work on because the indices of elements are mixed in with many non-work elements (0s) in the ‘out’ array.

I’ve come up with a few solutions but none of them are ideal…
1) Launch the same number of blocks each time. Blocks that have a 0 in the output array from the last stage will return immediately
Pros: Easy to implement, no need to copy memory back to host, no extra processing needed
Cons: Overhead from launching extra blocks, wasted resources, unnecessary work & overhead for the block scheduler

2) Copy the output array back to the host and have the host store the non-0 indices that need more processing into a new array. The new array can then be copied to the next kernel which launches only the required number of blocks.
Pros: Only the required number of blocks are launched
Cons: Memory copy to the host and back

3) Launch another kernel to accomplish (2) above instead of doing it on the host. Host reads a single integer value back that states how many blocks need to be launched.
Pros: No need to copy output array back to the host. Could use zero-copy to retrieve the single integer (so it would be relatively fast?)
Cons: Would only launch 1 block (no benefit from launching more?), would probably require atomics to control placement of indices in new array. Still need to copy back an integer to host

4) Store -1s instead of 0s if there is not more work to be done. Between stages, launch a kernel to do a parallel sort, moving all of the -1s to the start of the array. Then count the number of -1s and store that value in global memory so (a) the next stage can use it to determine where to start reading and (b) The host will know how many blocks to launch.
Pros: No need to copy output array back to the host. Could use zero-copy to retrieve the single integer (so it would be relatively fast?). Parallel sort would hopefully be more efficient than the kernel in (3) above.
Cons: Still need to copy back an integer to host, counting number of -1s would probably require atomics (or possibly a reduction which would mean another kernel call?)

This is probably most similar to the following two threads but I don’t think either of them accomplishes quite what I need.

I would really appreciate any advice you guys could offer. Thanks!

What you most likely want is called a stream compaction. Using a scan to generate indices to read, it efficiently handles your issue where


Thanks for the quick reply MisterAnderson. I came across your message in another thread earlier that mentioned stream compaction and that got me looking into scan but I couldn’t quite see how it would work with my problem.

After reading over some of the other posts you provided however, I feel I have a better understanding…so thanks for that!

The way I see the stream compaction working out in my case is:

1) Launch processing kernel

results in out = [1 1 0 0 1 0 1 1 0 0 0 1] for example

2) Launch scan kernel

outA = [1 2 2 2 3 3 4 5 5 5 5 6]

3) Launch compaction kernel…something like

outIndex = blockIdx.x * blockDim.x + threadIdx.x;

if( out[outIndex] == 1)


	int indexToStoreIn = OutA[blockIdx.x * blockDim.x + threadIdx.x] -1;

	out[indexToStoreIn]= outIndex;


I guess the compaction and scan could be combined into a single kernel where the indices of ‘out’ that are scanned by the current block are then compacted by the same block…

I still have to find a way to determine the number of blocks for the next stage and get it back to the CPU.

The way I see it, I can

  1. find max(outA) by having the block processing the last elements for the scan also do a reduction

  2. use an atomicAdd to a global variable anytime a block sets out to 1 but that will probably be slower…

Do you believe stream compaction is the best solution out of everything listed above? Do you think the extra time required to do the stream compaction and max calculation would outweigh the overhead of launching the same number of blocks every time (if most of them eventually ended up returning immediately?)

If I go with this method, would zero-copy will be the best possible way for the host to read the number of blocks for the next stage?

Thanks again for your time and assistance.

A reduction will definitely be faster.

I’m afraid only a benchmark will really tell. And it will depend on how many elements are to be removed. The stream compaction definitely has an overhead and the tipping point for better performance will depend on the number of elements removed, the amount of wasted time processing them and the time it takes to do the stream compaction.

tmurray would say yes. I don’t actually have a zero-copy capable machine handy to try out so I haven’t benchmarked it myself.

Thanks again for your input MisterAnderson, I’ll give this a shot and report back on the results!

You might consider using Thrust in your application. Thrust provides several functions for stream compaction and reduction operations.

All you need to do is #include the necessary header files, so it should be easy to integrate Thrust with your existing source code.

If you provide a little more information/code I can show you how to solve your problem w/ Thrust.