CUBLAS: Very low occupancy

Hi

I need to invert a double precision Matrix of size 2144 x 2144 by LU decomposition. This is done by the cublas library:

void invertMatrix(cublasHandle_t cublasHandle, double* src, double* dst, int n)
{
    int* permutation;
    int* info;

    int h_info1[1];
    int h_info2[1];

    cudaMalloc(&permutation, n *sizeof(int));
    cudaMalloc(&info, sizeof(int));

    double** A_h = (double **)malloc(sizeof(double *));    
    double** A_d;
    cudaMalloc(&A_d,sizeof(double *));

    A_h[0] = src;

    cudaMemcpy(A_d,A_h,sizeof(double *),cudaMemcpyHostToDevice);

    cublasDgetrfBatched(cublasHandle,n,A_d,n,permutation,info,1);

    cudaMemcpyAsync(h_info1,info,sizeof(int),cudaMemcpyDeviceToHost);

    double** C_h = (double **)malloc(sizeof(double *));
    double** C_d;
    cudaMalloc(&C_d,sizeof(double *));
    C_h[0] = dst;

    cudaMemcpy(C_d,C_h,sizeof(double *),cudaMemcpyHostToDevice);

    cublasDgetriBatched(cublasHandle,n,(const double **)A_d,n,permutation,C_d,n,info,1);

    cudaMemcpyAsync(h_info2,info,sizeof(int),cudaMemcpyDeviceToHost);

    cudaThreadSynchronize();

    if(h_info1[0]  != 0 || h_info2[0] != 0)
    {
        fprintf(stderr, "Factorization of matrix failed: Matrix may be singular\n");
        cudaDeviceReset();
        exit(EXIT_FAILURE);
    }

    cudaFree(A_d);
    free(A_h);
    cudaFree(C_d);
    free(C_h);
    cudaFree(permutation);
    cudaFree(info);
}

The code computes the correct result but the performance is very bad: Cublas needs 10 seconds to compute the inverse (Matlab only needs 360 milliseconds). I use a GeForce Titan X.

When I looked at the timeline using Nsight I realized that the kernels getri_panel and getrf_panel only achieve occupancy 6.2% (56.2% theoretical) and choose block size [128,1,1].
My programm also uses a few kernels that I wrote by myself. These achieve 75% oppupancy. Why is cublas performing so bad? Is there a way to tell cublas what block size it should use?

cublas getrfBatched and getriBatched are not intended to be high performance for a single large matrix. Their intended use is for a large number of “small” matrices, where the kernel launch overhead would be a significant contributor to the overall execution time, if they were inverted one-by-one.

This is mentioned in the cublas documentation:

http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-getrfbatched

“This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.”

You have no control over cublas block sizes or other low level configuration.

You might try LU using cusolver library instead:

http://docs.nvidia.com/cuda/cusolver/index.html#abstract

A sample code is available:

http://docs.nvidia.com/cuda/cuda-samples/index.html#cusolverdn-linear-solver-

To give some rough guidance as to what “small” means, it is highly advisable to look at these batched calls if your matrices are smaller than 32x32. There are surprisingly many real-life scenarios where one deals with a lot of very small matrices like that.

Since the relative performance of classical BLAS calls versus these batched calls will play out differently on different GPUs, with different matrix sizes and batch sizes, when in doubt, try it both ways to check which is faster.

Thanks a lot! That’s exactly what I was looking for. I’ll try the cusolver library.