Cublas data layout in GPU

nvidia doc says the cuBLAS library uses column-major storage .

but I have a matrix:

1 2 3 4 5

6 7 8 9 10

...

21 22 23 24 25

in this kernel function:

//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            printf("%d ", a[(r)*5+(c)]);
        }
        printf("\n");
    }
}

If it is column major, it should print 1 6 11 … But it still print 1 2 3 …

here is complete code:

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <iostream>
#include <algorithm>
#include <numeric>

//single thread print matrix
__global__ void printMatrixWithIndex(int *a, int n)
{
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            printf("%d ", a[(r)*5+(c)]);
        }
        printf("\n");
    }
}
int main()
{
    //test for cublas matrix memory allocation.
    const int n = 5*5;
    // matrix on host A abd B
    int *a ;
    int *d_a;
    a=new int[n];
    std::iota(a, a + n, 1);
    for(auto r=0;r!=5;++r)
    {
        for(auto c=0;c!=5;++c)
        {
            std::cout << a[(r)*5+(c)] << " ";
        }
        std::cout << std::endl;
    }
    cudaMalloc(&d_a, n*sizeof(int));
    cublasSetMatrix(5, 5, sizeof(int), a, 5, d_a, 5);
    printMatrixWithIndex<<<1, 1>>>(d_a, n);

    //free resource
    cudaFree(d_a);
    delete[] a;
    return 0;
}

If it still print 1 2 3 4 5, why cublas doc say it is column major data layout?

That is correct. This ensures interoperability with other libraries in the style of BLAS or LAPACK many (if not all) of which were originally implemented in Fortran which uses column-major storage.

If it is column major, then in the kernel a[1] should be a[1][0] is 6, instead of a[0][1] is 2 printed.

Sorry, I am not going to debug your code. I am the software engineer who created the original version of CUBLAS back in 2005-2007 and therefore can confirm that CUBLAS follows the column-major storage convention.

Hi foxnotflower,

the documentation of cublasSetMatrix (cuBLAS) says that both matrices are stored in column-major format. You are assuming - in your analysis - that the host matrix is row-major and the cublas matrix ist column-major.

The results for that simple case, where you are just copying, looks similar as if both are assumed to be column-major.

So, column-major storage not mean " translate row major to column major and store it in memory?"
It just a way of calculate a[M][N] in memory location?

Yes. In C notation it means:

Column-major order: array[col][row] or array[col * nrows + row]
Row-major order: array[row][col] or array[row * ncols + col]

Translation (= Transposing) is only necessary, when you change the format between row major and column major, not, when you stay on the same format.

You can reformulate typical calculations, e.g. A * B = (B^T * A^T)^T, so you can calculate a row major operation with a column major library without needing any translation/transposition. Only the parameters change: You would exchange ncols and nrows and exchange A and B for the example of a matrix multiplication.

1 Like