I’m about to develop a kernel where I need to permute the resulting vector which resides in shared memory before writing back to global memory. (Implementing walsh sequency ordered transform by extending the SDK hadamard ordered walsh transform)
I’m aware that reads from global memory should always be coalesced since the program cannot continue without having the value i.e. needs to wait. Though, when writing a value to global memory it would make sense to me if the program just continues without waiting for (? 400-600 ?) cycles for the values to be written. Provided that there are no data-dependencies (i.e. reads from the same global memory).
I also noticed that some SDK examples don’t write to global memory coalesced but focus on coalesced reads.
Basically what I need to decide is: Given the Hadmard sequence I need to permutate it e.g. for 8 values in such a way:
Hadamard: 1 2 3 4 5 6 7 8 Walsh : 1 8 4 5 2 7 3 6
Now I can either take the vector in shared memory and store it in global memory having non-coalesced writes all over the place or permute in shared memory (lots of bank conflicts) and store it coalesced…