CUBLAS: Very low occupancy

fvultier · December 10, 2015, 11:39am

Hi

I need to invert a double precision Matrix of size 2144 x 2144 by LU decomposition. This is done by the cublas library:

void invertMatrix(cublasHandle_t cublasHandle, double* src, double* dst, int n)
{
    int* permutation;
    int* info;

    int h_info1[1];
    int h_info2[1];

    cudaMalloc(&permutation, n *sizeof(int));
    cudaMalloc(&info, sizeof(int));

    double** A_h = (double **)malloc(sizeof(double *));    
    double** A_d;
    cudaMalloc(&A_d,sizeof(double *));

    A_h[0] = src;

    cudaMemcpy(A_d,A_h,sizeof(double *),cudaMemcpyHostToDevice);

    cublasDgetrfBatched(cublasHandle,n,A_d,n,permutation,info,1);

    cudaMemcpyAsync(h_info1,info,sizeof(int),cudaMemcpyDeviceToHost);

    double** C_h = (double **)malloc(sizeof(double *));
    double** C_d;
    cudaMalloc(&C_d,sizeof(double *));
    C_h[0] = dst;

    cudaMemcpy(C_d,C_h,sizeof(double *),cudaMemcpyHostToDevice);

    cublasDgetriBatched(cublasHandle,n,(const double **)A_d,n,permutation,C_d,n,info,1);

    cudaMemcpyAsync(h_info2,info,sizeof(int),cudaMemcpyDeviceToHost);

    cudaThreadSynchronize();

    if(h_info1[0]  != 0 || h_info2[0] != 0)
    {
        fprintf(stderr, "Factorization of matrix failed: Matrix may be singular\n");
        cudaDeviceReset();
        exit(EXIT_FAILURE);
    }

    cudaFree(A_d);
    free(A_h);
    cudaFree(C_d);
    free(C_h);
    cudaFree(permutation);
    cudaFree(info);
}

The code computes the correct result but the performance is very bad: Cublas needs 10 seconds to compute the inverse (Matlab only needs 360 milliseconds). I use a GeForce Titan X.

When I looked at the timeline using Nsight I realized that the kernels getri_panel and getrf_panel only achieve occupancy 6.2% (56.2% theoretical) and choose block size [128,1,1].
My programm also uses a few kernels that I wrote by myself. These achieve 75% oppupancy. Why is cublas performing so bad? Is there a way to tell cublas what block size it should use?

Robert_Crovella · December 10, 2015, 3:14pm

cublas getrfBatched and getriBatched are not intended to be high performance for a single large matrix. Their intended use is for a large number of “small” matrices, where the kernel launch overhead would be a significant contributor to the overall execution time, if they were inverted one-by-one.

This is mentioned in the cublas documentation:

[url]http://docs.nvidia.com/cuda/cublas/index.html#cublas-lt-t-gt-getrfbatched[/url]

“This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.”

You have no control over cublas block sizes or other low level configuration.

You might try LU using cusolver library instead:

[url]http://docs.nvidia.com/cuda/cusolver/index.html#abstract[/url]

A sample code is available:

[url]http://docs.nvidia.com/cuda/cuda-samples/index.html#cusolverdn-linear-solver-[/url]

njuffa · December 10, 2015, 5:23pm

To give some rough guidance as to what “small” means, it is highly advisable to look at these batched calls if your matrices are smaller than 32x32. There are surprisingly many real-life scenarios where one deals with a lot of very small matrices like that.

Since the relative performance of classical BLAS calls versus these batched calls will play out differently on different GPUs, with different matrix sizes and batch sizes, when in doubt, try it both ways to check which is faster.

fvultier · December 11, 2015, 11:25am

Thanks a lot! That’s exactly what I was looking for. I’ll try the cusolver library.

Topic		Replies	Views
cuBLAS or cuSolve strategy to invert many tiny matrices (>1 million 9x9 matrices) CUDA Programming and Performance	3	832	June 22, 2020
Matrix Inversion with cublasSgetri GPU-Accelerated Libraries	17	8958	April 10, 2019
cublasDgetrfBatched and cublasDgetriBatched GPU-Accelerated Libraries	1	4658	July 15, 2014
32x32 matrix invert NOT using cuBLAS? CUDA Programming and Performance	1	554	May 24, 2016
Why is my cublas so slow and is there anything I can do to fix it? CUDA Programming and Performance	1	1546	June 27, 2018
Calling CUBLAS' cublasDgetrfBatched proper procedure Legacy PGI Compilers	3	4942	January 30, 2017
Trouble with cublas and matrix inverse in Fortran Legacy PGI Compilers	1	1973	April 9, 2018
numerous, but small-sized matrix inversions looking 4 advise how-to speed-up problem CUDA Programming and Performance	4	3249	August 20, 2008
Inverse of a 3x3 matrix CUDA Programming and Performance	22	9797	February 20, 2024
cuBlas - MemCopy problem GPU-Accelerated Libraries	1	545	May 14, 2016

CUBLAS: Very low occupancy

Related topics