Stream compaction in OpenCL

I’m facing a stream compaction problem, exactly as described in sect. 39.3.1 of

My vector is in global memory and I have to compact it and place the result back in global memory. In the above cited article it is mentioned that “The addition of a native scatter in recent GPUs makes stream compaction considerably more efficient”.
Still, I cant understand the exact meaning of that sentence. Are there native OpenCL C instructions that allows to compact streams in global memory? More generally, which is the best way to compact a vector?