in my algorithm I need to access data in different fashion:
so to say, in row-major and then in column-major order.
in other words, the 1st kernel writes out the data in an ordinary
way (with stride 1) while the 2nd kernel reads the data
with a large stride (divisible by 32): if one conceives of 2D memory layout,
then threads of a block have to access one column of the data.
currently I have no idea how to avoid uncoalesced mem access here
so, my question is how severe is such an access pattern on GT200 ? (in terms of performance)
are there any reasonable ways to deal with it ? say, running another kernel between these two
in order to change the global memory layout using share memory first…