reverse large array

Hi,

I have the problem of reversing a large array of integers in global memory,
which can be done fast if the array fits into shared memory.

Is there any approach for speeding up the trivial solution for an array which
is too large for putting it into shared memory ???

Greetings, Uwe

This should be trivial - have N/2 threads, then each thread, i, reads two locations (i and N - i - 1) and then writes them back out swapped. Of course this will be I/O bound but that’s true of any reasonable approach that you might use for this.

Hi there!

I don’t know what is your exact problem, but running the array in reverse manner wouldn’t be enought? I mean for the operations that you need to do, you start at the end of the array and run the array until you get to the front/begin of it.

This is not a option for you?

Greettings,
Pedro

It will break coalescing on old hardware (pre GT200).
The best strategy is to load subsections of the array in order in shared memory, reverse them in shared memory and write them in order in global.

The question is more subtle than it sounds at first! The problem is easy to solve (just have thread X read from value X and write to value N-X-1) but it’s much harder if you need to do it as fast as possible.

To do this this efficiently you need understand, thoroughly, the requirements of coalesced memory reads and writes.
This is absolutely crucial for the best performance. The solution will likely be different for G80 and G200 GPUs with their different coalescing rules.

In a simple case where the length of the array is a multiple of 32 words, and the base address is aligned to a 16 word boundary, the basic solution is to load in a chunk of 32 sequential words with a warp, then write them back out again in a group of 32. If you’re on G80, you will need to use an intermediate step of reversing the order first by writing them to shared memory and re-reading. (G200 can do this for free when loading without any need for shuffling)

If the array isn’t an east multiple of 32, then the trick is to load in chunks of 32 words, then use a small queue in shared memory to reshuffle the phase of the reads with some left-over stored reads, and write out 32 words again (which consists of some old and some new values.)

A similar trick can be used if you’re not reversing in place, but writing to a new array but that new array has a different address phase.

Anyway, the key to doing it efficiently is to understand coalescing, and using some temporary storage of a few words in shared memory to reshuffle data as needed to keep every read and every write coalesced.