Column-Major Ordering

Hi everyone,

Does anyone know how column-ordering affects performance of a kernel? For example, is there any performance penalty for using say threads (0,0), (1,0), (2,0), …, which using all the threads from a ROW inside of a block, rather than using threads (0,0), (0,1), (0,2), …, which is a COLUMN of threads inside of a thread block.

Same question applies to blocks. I have a program, and it seems like, unless I have things backwards in my head or backwards in my code, that using blocks of threads to access each ROW of my resulting matrix gives much better results than using COL ordering. I have verified there is a performance decrease, but am not exactly sure why.

The important thing is to make sure that memory access gets coalesced, which usually means that threadIdx.x should be the trailing index of the most accessed array.

Other than that, order is mostly irrelevant. Particularly, the order in which blocks execute should not matter (apart from more esoteric effects like partition camping, but even that would unlikely be influenced by block order).