Array re-arrangement

I’ve got a 256 size int array (in shmem) which contains calculated values by different 256 threads. A non-valid value is -1.
Any suggestions (reference to the SDK?) as to how do I remove the -1 values from the array efficiantly?
For example if the original array was:
-1, -1, 4, 3, 4, -1, -1, 2, 0, 2, -1, …
What I want as result is:
4, 3, 4 , 2 ,0 , 2, …


What you’re trying to do is called a compaction. It’s a common and useful fundamental algorithm.
It’s usually done as a prefix scan sum… each element would add 0 or 1 to the scan depending on whether to include or
exclude the element. The result will be an accumulated sum which will tell the slot to move the current value to in order to remove the holes.

So your values
-1, -1, 4, 3, 4, -1, -1, 2, 0, 2, -1, …

would scan with 0/1 values

0 0 1 1 1 0 0 1 1 1 0 …

which would produce a scan of

0 0 0 1 2 3 3 3 4 5 6 6 …

and then you just write the original (non -1) value into the index provided to create the compacted array.

There’s a nicely written scan example in the SDK projects.


Thanks for the prompt answer, do you happen to have a simpler kernel then the one in the SDK?

How does this can be write coalesced?



As far as I know, for per block scans, it doesn’t get any simpler than scan_naive in the SDK. When I needed to scan a large array, I just used CUDPP.

If this is per block, you can do the compaction into shared memory and then write out the global memory coalesced. If it is a large array, you really cannot coalesce :( Note that as long as your data is somewhat dense, you still should get decent coalescing on G200 chips with their relaxed coalescing rules.

I’ll use the naive pattern for the meantime, hopefully it will be fast enough.

Its per block, and the chunk size is big (lets say 4000 size array to compact) so I guess this counts as big.

However i dont understand why I can’t coalesce?

I break this 4000 size to 256 sized chunks and i then compact each chunk in smem, as you’ve suggested. Am i missing something?

One more question though, since you’ve mentioned CUDPP :) I took a quick look at this library. Can I call those methods from within the kernel? or only from the host code?

what if I need this compactation (or any other algorithm) to be used in my kernel?



Sorry for any confusion by my terms. I was using “large array” the same way the SDK sample does: one array with say ~1 million elements (for example) that you want to scan using the whole GPU: not just per block.

There are methods in CUDPP that you can call from the kernel (the CTA level API I think it is called). I’ve never tried it so I can’t offer any tips or code snippets. I remember that the documentation was pretty good.

When I mentioned CUDPP w.r.t large arrays: I was just using the application level API: cudppScan and related functions to do all the dirty work of scanning a large array using the whole GPU. Doesn’t really apply to you since you just want a scan per block.

Hi guys,

Just wanted to thank you for the information and assistance, kernel now works like a charm :)

Full story at:

Thanks again… This forum is indeed amazing :)