Does anyone know how column-ordering affects performance of a kernel? For example, is there any performance penalty for using say threads (0,0), (1,0), (2,0), …, which using all the threads from a ROW inside of a block, rather than using threads (0,0), (0,1), (0,2), …, which is a COLUMN of threads inside of a thread block.
Same question applies to blocks. I have a program, and it seems like, unless I have things backwards in my head or backwards in my code, that using blocks of threads to access each ROW of my resulting matrix gives much better results than using COL ordering. I have verified there is a performance decrease, but am not exactly sure why.