Uncoalesced memory access penalty is it higher than the access to GM?

Is the penalty involved with uncoalesced memory access higher than the one involved with issuing one additional access to the global memory per thread?

It seems that we can’t get both in the algorithm that we are trying to port to cuda, due to the inherent data dependencies.

As we are a bit restricted with time to try out all the possibilities, it would be great if somebody could comment on this.

The best way to find this out is to try.

I personally think that it’s better to do two oalesced reads than one unoalesced, but there’s no way to be absolutly sure about this without trying both approaches.