I am writing a piece of linear algebra software, and I would like to ask what is the best method for storing dense matrices?

My current method is like so:

Initialize a matrix with a cudaMalloc((void **) matrix_point, sizeof(int) * rows * cols), meaning a matrix is simply represented by a series of rows, and so we must address it using the standard matrix trick: matrix[i][j] becomes matrix[i*rows + j].

This is great, since we can store the matrix in global memory and drop it into shared memory in coalesced memory accesses, since the rows are stored essentially as contiguous arrays right next to one another. Needless to say, this method is terrible for column operations.

What is the standard method for storing a matrix in CUDA? Is it this, or is there a more optimal strategy?