I believe you can only access either rows or columns in a coalesced manner.
To achieve good performance you could try to subdivide your array and let each block read a subarray in which the reads and writes would be partially coalesced. It should be faster than the naive approach.
You could also try to use a texture; maybe due to the texture cache you can gain speedup over scattered reads.
I am not sure that the approach in the transpose example will help me.
I need to read one colum in one block and one row in a different block (actually it can be more than one but the rows/columns are not adjacent).
If I understand correctly, the “transpose” example approach can work if I read say 16 columns at a time. I will not have enough shared memory to store 16 rows.
Did I missunderstand the example? Is there another way to do it?