How bad are non-coalesced STORES to gl. mem?

I’m about to develop a kernel where I need to permute the resulting vector which resides in shared memory before writing back to global memory. (Implementing walsh sequency ordered transform by extending the SDK hadamard ordered walsh transform)

I’m aware that reads from global memory should always be coalesced since the program cannot continue without having the value i.e. needs to wait. Though, when writing a value to global memory it would make sense to me if the program just continues without waiting for (? 400-600 ?) cycles for the values to be written. Provided that there are no data-dependencies (i.e. reads from the same global memory).

I also noticed that some SDK examples don’t write to global memory coalesced but focus on coalesced reads.

Basically what I need to decide is: Given the Hadmard sequence I need to permutate it e.g. for 8 values in such a way:

Hadamard: 1 2 3 4 5 6 7 8

Walsh   : 1 8 4 5 2 7 3 6

Now I can either take the vector in shared memory and store it in global memory having non-coalesced writes all over the place or permute in shared memory (lots of bank conflicts) and store it coalesced…

Any thoughts?

Do the permutation in shared memory and coalesce the writes to global memory. Writing coalesced is just as important as reading. To prove it, I just modified my bw_test code (search the forums if you want it) to read coalesced, but write uncoalesced. For the test that copies floats from one array to another, performance dropped from 56.8 GiB/s bandwidth to 8.6 GiB/s (on G80).

The issue isn’t so much latency as it is throughput. Uncoalesced reads (and writes) can’t make use of the entire memory bus width like coalesced ones can and the write must be split into several transactions. Things are a little different on compute >1.2 and you can do permutations like this and still coalesce as long as the permutations happen only in defined so many byte wide windows (see the 2.0b2 programming guide for the details).

Thanks for the fast reply.

I guess then I already know how to optimize my other kernels. Yaya

Since I’m permuting over 512 element vector the new coalescing rules are of no use for this operation.
I’ll load all the vector elements from shared memory into 2 registers (I got twice as much elements as threads) and then write them back (with lots of bank conflicts) into shared memory. Just swaping doesn’t work so I will use registers which I’ll have plenty of when I get my GX280…