Is the penalty involved with uncoalesced memory access higher than the one involved with issuing one additional access to the global memory per thread?
It seems that we can’t get both in the algorithm that we are trying to port to cuda, due to the inherent data dependencies.
As we are a bit restricted with time to try out all the possibilities, it would be great if somebody could comment on this.