Best matrix storage method?

I am writing a piece of linear algebra software, and I would like to ask what is the best method for storing dense matrices?

My current method is like so:

Initialize a matrix with a cudaMalloc((void **) matrix_point, sizeof(int) * rows * cols), meaning a matrix is simply represented by a series of rows, and so we must address it using the standard matrix trick: matrix[i][j] becomes matrix[i*rows + j].

This is great, since we can store the matrix in global memory and drop it into shared memory in coalesced memory accesses, since the rows are stored essentially as contiguous arrays right next to one another. Needless to say, this method is terrible for column operations.

What is the standard method for storing a matrix in CUDA? Is it this, or is there a more optimal strategy?

Internal CUDA multidimensional arrays (for texturing) are stored in some sort of local spatially coherent ordering like Morton ordering.

For linear algebra work, there really isn’t a standard, but I think most people have stuck with column major ordering because it preserves compatibility with a lot of legacy code and algorithms written to expect Fortran ordered data.

I’m lucky enough to be free from such constraints. I’m glad to see that there is no standard method, and that I didn’t just waste a whole ton of work.