'coompacting' result array

I dont know if this is a common problem or if it has been asked about before.
I have a large array (I am talking 10^6 at least) on which the device outputs results. Lets say each thread puts out a result of either a 0 or some integer.
When I copy over the results to the host, my host is only concerned with non zero results. So is there a way i can reduce the result array such that there are only non zero values?
Currently I am using atomics to store results contiguously. But this is proving to be too slow and the biggest bottleneck to my application.

Stream compaction has been implemented several times for CUDA. You should take a look at Thrust:

http://code.google.com/p/thrust/

Another implementation is in CUDAPP:

http://gpgpu.org/developer/cudpp

Does Thrust require any specific CUDA compute capability version? I only see a requirement for the Driver version.

How effective is stream compaction in terms of speedup?
I have only one kernel that outputs many results. I just need to get those results back and print them.
Right now, if I do the memory writes using atomics on the kernel, my program takes 9 seconds. If i comment out the writing part (and printing, since there will be no results), it takes less than a second.

If you comment out the writing part of your kernel, it is quite likely that nvcc will prune much of the preceding code (since it now has no effect, why compute it?). A better way to see the effect of the write is to do it twice instead, and measure how much longer your kernel takes.