Matrix Inversion with cublasSgetri

jroyer · August 13, 2014, 6:09pm

Hello, I’m trying to invert a Matrix using cublasSgetriBatched and I’m getting memory errors. My code breaks when I call cublasSgetrfBatched. I suspect a problem with the formats (particularly with the array of pointers). I consistently get a memory error. Any help would be greatly appreciated. Here is the code (I’m compiling it on Windows):

extern “C” __declspec(dllexport) void LUdec(float *a, float *b, int *ra, int *PP, int *inf);
void LUdec(float *a, float *b, int *ra, int *PP, int *inf) {

// Start CUBLAS
cublasHandle_t handle;
cublasCreate_v2(&handle);

int nra = ra[0];
int batchsize = 1;
	
int *INFO;
int *P;
cudaMalloc<int>(&P, batchsize * nra * sizeof(int));
cudaMalloc<int>(&INFO, batchsize * sizeof(int));
	
int lda = nra;

float *aa[1];
aa[0] = a;
float **d_a;

// Allocate Memory on the Device and Copy Matrix
cudaMalloc<float*>(&d_a, sizeof(aa));
cudaMemcpy(d_a, aa, sizeof(aa), cudaMemcpyHostToDevice);

// LU Decomposition
cublasSgetrfBatched(handle, nra, d_a, lda, P, INFO, batchsize);

Robert_Crovella · August 13, 2014, 6:53pm

You might be interested in this recent question on the same subject:

[url]https://devtalk.nvidia.com/default/topic/761729/gpu-accelerated-libraries/cublasdgetrfbatched-and-cublasdgetribatched/[/url]

jroyer · August 19, 2014, 6:53pm

Thank you very much for the answer. The code works great for 1 matrix. However, I can’t get the code working for multiple matrices. It might be an issue with row vs. column major, but I can’t figure that out. My code is pasted below. Any help would be, gain, greatly appreciated.

#define batchsize 2

void invert_device(double (*d_a)[batchsize], double (*d_b)[batchsize], int *nra)
{
cublasHandle_t handle;
cublasCreate_v2(&handle);

int *P;
int *INFO;

cudaMalloc<int>(&P, nra[0] * batchsize * sizeof(int));
cudaMalloc<int>(&INFO, batchsize * sizeof(int));

int lda = nra[0];

double *aa[batchsize];
aa[0] = d_a[0];
aa[1] = d_a[1];

double** d_aa;

// Copy Matrix to Device
cudaMalloc<double *>(&d_aa, batchsize * nra[0] * nra[0] * sizeof(double));
cudaMemcpy(d_aa, aa, batchsize * nra[0] * nra[0] * sizeof(double),cudaMemcpyHostToDevice);

// Perform LU Decomposition
cublasDgetrfBatched(handle, nra[0], d_aa, lda, P, INFO, batchsize);

double *bb[batchsize];
bb[0] = d_b[0];
bb[1] = d_b[1];

double** d_bb;

// Allocate Space for Inverse Matrix
cudaMalloc<double *>(&d_bb, batchsize * nra[0] * nra[0] * sizeof(double));
cudaMemcpy(d_bb, bb, batchsize * nra[0] * nra[0] * sizeof(double), cudaMemcpyHostToDevice);

// Compute Inverse
cublasDgetriBatched(handle, nra[0], d_aa, lda, P, d_bb, nra[0], INFO, batchsize);

cudaFree(P);
cudaFree(INFO);
cublasDestroy_v2(handle);

}

Robert_Crovella · August 20, 2014, 10:34pm

If you want to provide a complete, compilable code, that supplies test values and checks for accuracy of results, I’ll take a look. Also please define “I can’t get it working”. I don’t see any proper cuda or cublas error checking. Does the code produce an error if you run it with cuda-memcheck? Or is it just that the numerical results are not what you expected?

Robert_Crovella · August 21, 2014, 4:01am

Here’s a fully worked example showing inversion of four 3x3 matrices:

$ cat t540.cu
#include <stdio.h>
#include <stdlib.h>
#include <cublas_v2.h>

#define cudacall(call)                                                                                                          \
    do                                                                                                                          \
    {                                                                                                                           \
        cudaError_t err = (call);                                                                                               \
        if(cudaSuccess != err)                                                                                                  \
        {                                                                                                                       \
            fprintf(stderr,"CUDA Error:\nFile = %s\nLine = %d\nReason = %s\n", __FILE__, __LINE__, cudaGetErrorString(err));    \
            cudaDeviceReset();                                                                                                  \
            exit(EXIT_FAILURE);                                                                                                 \
        }                                                                                                                       \
    }                                                                                                                           \
    while (0)

#define cublascall(call)                                                                                        \
    do                                                                                                          \
    {                                                                                                           \
        cublasStatus_t status = (call);                                                                         \
        if(CUBLAS_STATUS_SUCCESS != status)                                                                     \
        {                                                                                                       \
            fprintf(stderr,"CUBLAS Error:\nFile = %s\nLine = %d\nCode = %d\n", __FILE__, __LINE__, status);     \
            cudaDeviceReset();                                                                                  \
            exit(EXIT_FAILURE);                                                                                 \
        }                                                                                                       \
                                                                                                                \
    }                                                                                                           \
    while(0)


void invert(float** src, float** dst, int n, int batchSize)
{
    cublasHandle_t handle;
    cublascall(cublasCreate_v2(&handle));

    int *P, *INFO;

    cudacall(cudaMalloc(&P, n * batchSize * sizeof(int)));
    cudacall(cudaMalloc(&INFO,  batchSize * sizeof(int)));

    int lda = n;

    float **A = (float **)malloc(batchSize*sizeof(float *));
    float **A_d, *A_dflat;
    cudacall(cudaMalloc(&A_d,batchSize*sizeof(float *)));
    cudacall(cudaMalloc(&A_dflat, n*n*batchSize*sizeof(float)));
    A[0] = A_dflat;
    for (int i = 1; i < batchSize; i++)
      A[i] = A[i-1]+(n*n);
    cudacall(cudaMemcpy(A_d,A,batchSize*sizeof(float *),cudaMemcpyHostToDevice));
    for (int i = 0; i < batchSize; i++)
      cudacall(cudaMemcpy(A_dflat+(i*n*n), src[i], n*n*sizeof(float), cudaMemcpyHostToDevice));

    cublascall(cublasSgetrfBatched(handle,n,A_d,lda,P,INFO,batchSize));

    int INFOh[batchSize];
    cudacall(cudaMemcpy(INFOh,INFO,batchSize*sizeof(int),cudaMemcpyDeviceToHost));

    for (int i = 0; i < batchSize; i++)
      if(INFOh[i]  != 0)
      {
        fprintf(stderr, "Factorization of matrix %d Failed: Matrix may be singular\n", i);
        cudaDeviceReset();
        exit(EXIT_FAILURE);
      }

    float **C = (float **)malloc(batchSize*sizeof(float *));
    float **C_d, *C_dflat;
    cudacall(cudaMalloc(&C_d,batchSize*sizeof(float *)));
    cudacall(cudaMalloc(&C_dflat, n*n*batchSize*sizeof(float)));
    C[0] = C_dflat;
    for (int i = 1; i < batchSize; i++)
      C[i] = C[i-1] + (n*n);
    cudacall(cudaMemcpy(C_d,C,batchSize*sizeof(float *),cudaMemcpyHostToDevice));
    cublascall(cublasSgetriBatched(handle,n,(const float **)A_d,lda,P,C_d,lda,INFO,batchSize));

    cudacall(cudaMemcpy(INFOh,INFO,batchSize*sizeof(int),cudaMemcpyDeviceToHost));

    for (int i = 0; i < batchSize; i++)
      if(INFOh[i] != 0)
      {
        fprintf(stderr, "Inversion of matrix %d Failed: Matrix may be singular\n", i);
        cudaDeviceReset();
        exit(EXIT_FAILURE);
      }
    for (int i = 0; i < batchSize; i++)
      cudacall(cudaMemcpy(dst[i], C_dflat + (i*n*n), n*n*sizeof(float), cudaMemcpyDeviceToHost));
    cudaFree(A_d); cudaFree(A_dflat); free(A);
    cudaFree(C_d); cudaFree(C_dflat); free(C);
    cudaFree(P); cudaFree(INFO); cublasDestroy_v2(handle);
}


void test_invert()
{
    const int n = 3;
    const int mybatch = 4;

    //Random matrix with full pivots
    float full_pivot[n*n] = { 0.5, 3, 4,
                                1, 3, 10,
                                4 , 9, 16 };

    //Almost same as above matrix with first pivot zero
    float zero_pivot[n*n] = { 0, 3, 4,
                              1, 3, 10,
                              4 , 9, 16 };

    float another_zero_pivot[n*n] = { 0, 3, 4,
                                      1, 5, 6,
                                      9, 8, 2 };

    float another_full_pivot[n * n] = { 22, 3, 4,
                                        1, 5, 6,
                                        9, 8, 2 };

    float *result_flat = (float *)malloc(mybatch*n*n*sizeof(float));
    float **results = (float **)malloc(mybatch*sizeof(float *));
    for (int i = 0; i < mybatch; i++)
      results[i] = result_flat + (i*n*n);
    float **inputs = (float **)malloc(mybatch*sizeof(float *));
    inputs[0]  = zero_pivot;
    inputs[1]  = full_pivot;
    inputs[2]  = another_zero_pivot;
    inputs[3]  = another_full_pivot;

    for (int qq = 0; qq < mybatch; qq++){
      fprintf(stdout, "Input %d:\n\n", qq);
      for(int i=0; i<n; i++)
      {
        for(int j=0; j<n; j++)
            fprintf(stdout,"%f\t",inputs[qq][i*n+j]);
        fprintf(stdout,"\n");
      }
    }
    fprintf(stdout,"\n\n");

    invert(inputs, results, n, mybatch);

    for (int qq = 0; qq < mybatch; qq++){
      fprintf(stdout, "Inverse %d:\n\n", qq);
      for(int i=0; i<n; i++)
      {
        for(int j=0; j<n; j++)
            fprintf(stdout,"%f\t",results[qq][i*n+j]);
        fprintf(stdout,"\n");
      }
    }
}

int main()
{
    test_invert();

    return 0;
}
$ nvcc -arch=sm_20 -o t540 t540.cu -lcublas
$ ./t540
Input 0:

0.000000        3.000000        4.000000
1.000000        3.000000        10.000000
4.000000        9.000000        16.000000
Input 1:

0.500000        3.000000        4.000000
1.000000        3.000000        10.000000
4.000000        9.000000        16.000000
Input 2:

0.000000        3.000000        4.000000
1.000000        5.000000        6.000000
9.000000        8.000000        2.000000
Input 3:

22.000000       3.000000        4.000000
1.000000        5.000000        6.000000
9.000000        8.000000        2.000000


Inverse 0:

-0.700000       -0.200000       0.300000
0.400000        -0.266667       0.066667
-0.050000       0.200000        -0.050000
Inverse 1:

-1.076923       -0.307692       0.461538
0.615385        -0.205128       -0.025641
-0.076923       0.192308        -0.038462
Inverse 2:

-4.750000       3.250000        -0.250000
6.500000        -4.500000       0.500000
-4.625000       3.375000        -0.375000
Inverse 3:

0.045894        -0.031401       0.002415
-0.062802       -0.009662       0.154589
0.044686        0.179952        -0.129227
$

I checked the results with wolfram alpha:

http://www.wolframalpha.com/widgets/view.jsp?id=35f68681262e42ea89b0834caa51635b

they seem to be correct.

jroyer · August 21, 2014, 3:03pm

Thank you very much !

Robert_Crovella · August 21, 2014, 3:17pm

If you are doing this repetitively, this can be further optimized by managing the allocations separately.

jroyer · August 21, 2014, 4:37pm

Do you mean looping over the Batchsize? When matrix sizes are big I found it faster to run the same procedure with smaller batches.

Robert_Crovella · August 21, 2014, 5:19pm

No, I mean if you repeatedly have to invert 4 matrices, for example, then you might not want to use the function exactly as I have written it. You might want to leave allocations like A_dflat and C_dflat, and perhaps others, permanently in place, and just re-use them.

jroyer · August 21, 2014, 5:44pm

I see. Thanks again for your help

shaurakar · November 30, 2017, 4:00am

Is there any way to do inversion of a single matrix with size, 600x600 and above?
I have searched various forums and it keeps showing inversion in batched function and max inversion of 32x32.

So, is there a way I can invert a single matrix of 600x600 efficiently using cuda libraries?
Any help is highly appreciated.

Thanks

Robert_Crovella · December 1, 2017, 2:12am

this document:

[url]https://developer.nvidia.com/sites/default/files/akamai/cuda/files/Misc/mygpu.pdf[/url]

describes how to do matrix inversion with MAGMA

Another approach would be to solve a system:

AX=I

where I is the identity matrix, using cusolver, such as the approaches outlined here:

[url]gpu - Solving dense linear systems AX = B with CUDA - Stack Overflow

I think if you just google:

cuda matrix inverse

or

cusolver matrix inverse

You’ll find other examples/ideas

shaurakar · December 1, 2017, 2:28am

Thank you very much txbob
I saw those methods…but had some issue with magma integration. And for the iterative solver was little skeptical regarding the convergence.
However, our study group found a method to compute inverse using Svd.
Since Svd is already present in cuda library. Using it and some matrix multiplication, we could compute inverse.
But didn’t want as Svd might take some time.
Thank you once again. I will try to implement using all the methods mentioned, and put the time taken.

leeecho0717 · March 27, 2019, 8:19am

Robert_Crovella:

Here’s a fully worked example showing inversion of four 3x3 matrices:

$ cat t540.cu
#include <stdio.h>
#include <stdlib.h>
#include <cublas_v2.h>

#define cudacall(call)                                                                                                          \
    do                                                                                                                          \
    {                                                                                                                           \
        cudaError_t err = (call);                                                                                               \
        if(cudaSuccess != err)                                                                                                  \
        {                                                                                                                       \
            fprintf(stderr,"CUDA Error:\nFile = %s\nLine = %d\nReason = %s\n", __FILE__, __LINE__, cudaGetErrorString(err));    \
            cudaDeviceReset();                                                                                                  \
            exit(EXIT_FAILURE);                                                                                                 \
        }                                                                                                                       \
    }                                                                                                                           \
    while (0)

#define cublascall(call)                                                                                        \
    do                                                                                                          \
    {                                                                                                           \
        cublasStatus_t status = (call);                                                                         \
        if(CUBLAS_STATUS_SUCCESS != status)                                                                     \
        {                                                                                                       \
            fprintf(stderr,"CUBLAS Error:\nFile = %s\nLine = %d\nCode = %d\n", __FILE__, __LINE__, status);     \
            cudaDeviceReset();                                                                                  \
            exit(EXIT_FAILURE);                                                                                 \
        }                                                                                                       \
                                                                                                                \
    }                                                                                                           \
    while(0)


void invert(float** src, float** dst, int n, int batchSize)
{
    cublasHandle_t handle;
    cublascall(cublasCreate_v2(&handle));

    int *P, *INFO;

    cudacall(cudaMalloc(&P, n * batchSize * sizeof(int)));
    cudacall(cudaMalloc(&INFO,  batchSize * sizeof(int)));

    int lda = n;

    float **A = (float **)malloc(batchSize*sizeof(float *));
    float **A_d, *A_dflat;
    cudacall(cudaMalloc(&A_d,batchSize*sizeof(float *)));
    cudacall(cudaMalloc(&A_dflat, n*n*batchSize*sizeof(float)));
    A[0] = A_dflat;
    for (int i = 1; i < batchSize; i++)
      A[i] = A[i-1]+(n*n);
    cudacall(cudaMemcpy(A_d,A,batchSize*sizeof(float *),cudaMemcpyHostToDevice));
    for (int i = 0; i < batchSize; i++)
      cudacall(cudaMemcpy(A_dflat+(i*n*n), src[i], n*n*sizeof(float), cudaMemcpyHostToDevice));

    cublascall(cublasSgetrfBatched(handle,n,A_d,lda,P,INFO,batchSize));

    int INFOh[batchSize];
    cudacall(cudaMemcpy(INFOh,INFO,batchSize*sizeof(int),cudaMemcpyDeviceToHost));

    for (int i = 0; i < batchSize; i++)
      if(INFOh[i]  != 0)
      {
        fprintf(stderr, "Factorization of matrix %d Failed: Matrix may be singular\n", i);
        cudaDeviceReset();
        exit(EXIT_FAILURE);
      }

    float **C = (float **)malloc(batchSize*sizeof(float *));
    float **C_d, *C_dflat;
    cudacall(cudaMalloc(&C_d,batchSize*sizeof(float *)));
    cudacall(cudaMalloc(&C_dflat, n*n*batchSize*sizeof(float)));
    C[0] = C_dflat;
    for (int i = 1; i < batchSize; i++)
      C[i] = C[i-1] + (n*n);
    cudacall(cudaMemcpy(C_d,C,batchSize*sizeof(float *),cudaMemcpyHostToDevice));
    cublascall(cublasSgetriBatched(handle,n,(const float **)A_d,lda,P,C_d,lda,INFO,batchSize));

    cudacall(cudaMemcpy(INFOh,INFO,batchSize*sizeof(int),cudaMemcpyDeviceToHost));

    for (int i = 0; i < batchSize; i++)
      if(INFOh[i] != 0)
      {
        fprintf(stderr, "Inversion of matrix %d Failed: Matrix may be singular\n", i);
        cudaDeviceReset();
        exit(EXIT_FAILURE);
      }
    for (int i = 0; i < batchSize; i++)
      cudacall(cudaMemcpy(dst[i], C_dflat + (i*n*n), n*n*sizeof(float), cudaMemcpyDeviceToHost));
    cudaFree(A_d); cudaFree(A_dflat); free(A);
    cudaFree(C_d); cudaFree(C_dflat); free(C);
    cudaFree(P); cudaFree(INFO); cublasDestroy_v2(handle);
}


void test_invert()
{
    const int n = 3;
    const int mybatch = 4;

    //Random matrix with full pivots
    float full_pivot[n*n] = { 0.5, 3, 4,
                                1, 3, 10,
                                4 , 9, 16 };

    //Almost same as above matrix with first pivot zero
    float zero_pivot[n*n] = { 0, 3, 4,
                              1, 3, 10,
                              4 , 9, 16 };

    float another_zero_pivot[n*n] = { 0, 3, 4,
                                      1, 5, 6,
                                      9, 8, 2 };

    float another_full_pivot[n * n] = { 22, 3, 4,
                                        1, 5, 6,
                                        9, 8, 2 };

    float *result_flat = (float *)malloc(mybatch*n*n*sizeof(float));
    float **results = (float **)malloc(mybatch*sizeof(float *));
    for (int i = 0; i < mybatch; i++)
      results[i] = result_flat + (i*n*n);
    float **inputs = (float **)malloc(mybatch*sizeof(float *));
    inputs[0]  = zero_pivot;
    inputs[1]  = full_pivot;
    inputs[2]  = another_zero_pivot;
    inputs[3]  = another_full_pivot;

    for (int qq = 0; qq < mybatch; qq++){
      fprintf(stdout, "Input %d:\n\n", qq);
      for(int i=0; i<n; i++)
      {
        for(int j=0; j<n; j++)
            fprintf(stdout,"%f\t",inputs[qq][i*n+j]);
        fprintf(stdout,"\n");
      }
    }
    fprintf(stdout,"\n\n");

    invert(inputs, results, n, mybatch);

    for (int qq = 0; qq < mybatch; qq++){
      fprintf(stdout, "Inverse %d:\n\n", qq);
      for(int i=0; i<n; i++)
      {
        for(int j=0; j<n; j++)
            fprintf(stdout,"%f\t",results[qq][i*n+j]);
        fprintf(stdout,"\n");
      }
    }
}

int main()
{
    test_invert();

    return 0;
}
$ nvcc -arch=sm_20 -o t540 t540.cu -lcublas
$ ./t540
Input 0:

0.000000        3.000000        4.000000
1.000000        3.000000        10.000000
4.000000        9.000000        16.000000
Input 1:

0.500000        3.000000        4.000000
1.000000        3.000000        10.000000
4.000000        9.000000        16.000000
Input 2:

0.000000        3.000000        4.000000
1.000000        5.000000        6.000000
9.000000        8.000000        2.000000
Input 3:

22.000000       3.000000        4.000000
1.000000        5.000000        6.000000
9.000000        8.000000        2.000000


Inverse 0:

-0.700000       -0.200000       0.300000
0.400000        -0.266667       0.066667
-0.050000       0.200000        -0.050000
Inverse 1:

-1.076923       -0.307692       0.461538
0.615385        -0.205128       -0.025641
-0.076923       0.192308        -0.038462
Inverse 2:

-4.750000       3.250000        -0.250000
6.500000        -4.500000       0.500000
-4.625000       3.375000        -0.375000
Inverse 3:

0.045894        -0.031401       0.002415
-0.062802       -0.009662       0.154589
0.044686        0.179952        -0.129227
$

I checked the results with wolfram alpha:

http://www.wolframalpha.com/widgets/view.jsp?id=35f68681262e42ea89b0834caa51635b

they seem to be correct.

Sorry, I am new with cuda and I don’t understand why don’t just directly transfer data from A to A_d, but instead use A_dflag？(line 53-55).I would appreciate it very much if you would help me with it.

Robert_Crovella · March 27, 2019, 12:37pm

Because the work is being done on a batch of matrices.

The API expects an array of pointers to the matrices. Therefore the copy process is a two-step copy process. You’ll notice that what is being copied in line 53 is an array of pointers, not the matrix data itself.

In line 55 the matrix data is copied.

This is a usual or 2-step “deep copy” process in CUDA, and is covered in other places.

[url]c++ - CUDA, Using 2D and 3D Arrays - Stack Overflow

leeecho0717 · April 10, 2019, 7:33am

Thanks for reply. So if inversing just one matrix, the array of pointer will not be used and just transfer src to A_d?

Robert_Crovella · April 10, 2019, 12:25pm

No, that is not correct.

If you used the batch API, you must provide pointer-to-pointer. That is the parameter that is in the function prototype.

leeecho0717 · April 10, 2019, 12:49pm

Yes, I read the document and it needs array of pointer.
So it is necessary to use two-step copy process. I need understand more deeply.
Thanks so much for reply.