Coalesced global scatter and gather

This question has probably been answered countless times in the forum, but the search tool didn’t pull anything good up.

What’s the efficient way to perform a global scatter or gather? Given scatter indices (i.e. target[scatter[i]] = source[i]), I can simply set the scatter indices as the key and the source data as the values and radix sort. It takes a number of passes, but you get coalesced memory access the whole time. I imagine it is much better than a random scatter.

But what’s an efficient pattern given gather indices? I can turn gather indices into scatter indices by setting the gather indices as the key and the sequence (0, 1, 2, …) as the value and sorting, the resulting values will then be scatter indices. But then I still have to sort a second time to move the actual data around. Two sorts seems excessively expensive.

I came across this paper, but it doesn’t seem like a good idea, as it takes tons of read passes and seems generally unscalable, but possibly I’m just not understanding it.
[url=“http://www.cse.ust.hk/catalac/papers/scatter_sc07.pdf”]http://www.cse.ust.hk/catalac/papers/scatter_sc07.pdf[/url]

Thanks,
.sean