how bad is uncoalesced gmem access on GT200 ?


in my algorithm I need to access data in different fashion:
so to say, in row-major and then in column-major order.

in other words, the 1st kernel writes out the data in an ordinary
way (with stride 1) while the 2nd kernel reads the data
with a large stride (divisible by 32): if one conceives of 2D memory layout,
then threads of a block have to access one column of the data.

currently I have no idea how to avoid uncoalesced mem access here
so, my question is how severe is such an access pattern on GT200 ? (in terms of performance)
are there any reasonable ways to deal with it ? say, running another kernel between these two
in order to change the global memory layout using share memory first…


I think that you should try

“threads of a block have to access sub-matrix of 2-D data”

and utilize shared memory.