Implement 2D matrix transpose using warp shuffle without local memory

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

For those who might be visiting this thread, please note the previous comment where I said:

However, it’s the nature of such things that people will visit the thread and perhaps try it. For those who decide to implement block level transpose, a colleague suggested to me that you might be better served by doing an “ordinary” shared memory transpose, conceptually similar to what is described here, modified to produce individual block level results. I haven’t benchmarked it carefully, but there is good reason to think that it may be faster in some cases than doing the shuffle/“register only” version depicted here.