That is correct. This ensures interoperability with other libraries in the style of BLAS or LAPACK many (if not all) of which were originally implemented in Fortran which uses column-major storage.

Sorry, I am not going to debug your code. I am the software engineer who created the original version of CUBLAS back in 2005-2007 and therefore can confirm that CUBLAS follows the column-major storage convention.

the documentation of cublasSetMatrix (cuBLAS) says that both matrices are stored in column-major format. You are assuming - in your analysis - that the host matrix is row-major and the cublas matrix ist column-major.

The results for that simple case, where you are just copying, looks similar as if both are assumed to be column-major.

So, column-major storage not mean " translate row major to column major and store it in memory?"
It just a way of calculate a[M][N] in memory location?

Column-major order: array[col][row] or array[col * nrows + row]
Row-major order: array[row][col] or array[row * ncols + col]

Translation (= Transposing) is only necessary, when you change the format between row major and column major, not, when you stay on the same format.

You can reformulate typical calculations, e.g. A * B = (B^T * A^T)^T, so you can calculate a row major operation with a column major library without needing any translation/transposition. Only the parameters change: You would exchange ncols and nrows and exchange A and B for the example of a matrix multiplication.