CUDA array memory placement & block/thread foramt row-major versus column-major

I would like to find a refererence to a definitive statement regarding the “correct” method to place cells into 2D arrays when using CUDA. Also, how are the blocks and threads arranged?

I know that C uses the row-major placement. FORTRAN and Matlab use column-major placement.

I have seen “hints” in these forums, though, that the GPU blocks and threads are arranged and should be loaded as row-major placement.

I can’t find a clear statement in the CUDA documentation I’ve read. Could be there, but I haven’t seen it.

Also, the new book by Kirk & Hwu is unclear on this concept (IMHO).

If you feel like you can provide an answer to this question and want to use small 2D arrays as examples please use a non-square array format so it will be extremely clear which dimension is which.