Can someone help me optimize some 3x3 matrix inverse code?

MutantJohn · August 18, 2014, 4:44pm

Hey guys,

I’m still kind of new to CUDA. But basically, I need the 3x3 inverse of a matrix of floats. I figured the easiest and fastest way to do this would be to brute force it. This way I’m not doing any LU stuff. I actually forgot the real name of it. Is it a left upper diagonalization? Either way, that is only good for large matrices when what I’m doing is a million 3x3 inverses instead of one inverse on a million by million array.

So, points are stored like this :

It is also worth noting that all points are stored in one contiguous array, created by CudaMalloc().

struct point
{
   union
   {
      float p[3];
      struct { float x, y, z; };
   };
};

And these are held in a tetrahedron by :

struct tetra
{
    int p[4]; // vertices

    /* ... */
};

And here’s how I brute-force the 3x3 inverse :

__device__
inline float two_by_two_determinant(float *m )
{
    return m[0] * m[3] - m[1] * m[2];
}

__device__
inline float three_by_three_determinant(float *m)
{
    float u[4] = { m[4], m[5], m[7], m[8] };
    float v[4] = { m[3], m[5], m[6], m[8] };
    float w[4] = { m[3], m[4], m[6], m[7] };

    return m[0] * two_by_two_determinant(u) -
           m[1] * two_by_two_determinant(v) +
           m[2] * two_by_two_determinant(w);
}

__device__
void three_by_three_inverse(float *m)
{
    float a0[4] = { m[4], m[5], m[7], m[8] };
    float a1[4] = { m[2], m[1], m[8], m[7] };
    float a2[4] = { m[1], m[2], m[4], m[5] };
    float a3[4] = { m[5], m[3], m[8], m[6] };
    float a4[4] = { m[0], m[2], m[6], m[8] };
    float a5[4] = { m[2], m[0], m[5], m[3] };
    float a6[4] = { m[3], m[4], m[6], m[7] };
    float a7[4] = { m[1], m[0], m[7], m[6] };
    float a8[4] = { m[0], m[1], m[3], m[4] };

    float divisor = 1 / three_by_three_determinant(m);

    float n[9] = { two_by_two_determinant(a0),
                   two_by_two_determinant(a1),
                   two_by_two_determinant(a2),
                   two_by_two_determinant(a3),
                   two_by_two_determinant(a4),
                   two_by_two_determinant(a5),
                   two_by_two_determinant(a6),
                   two_by_two_determinant(a7),
                   two_by_two_determinant(a8)
                 };

    #pragma unroll
    for (int i = 0; i < 9; ++i) 
    {
        m[i] = n[i] * divisor;
    }
}

This function is called with something like this :

__device__
int contain_test(tetra *t, point *points, int pid)
{
    float *a = (points + t->p[0])->p;
    float *b = (points + t->p[1])->p;
    float *c = (points + t->p[2])->p;
    float *d = (points + t->p[3])->p;

    float *e = (points + pid)->p;    

float A[9] = { 0 };

    float u[3] = { b[0] - a[0], b[1] - a[1], b[2] - a[2] };
    float v[3] = { c[0] - a[0], c[1] - a[1], c[2] - a[2] };
    float w[3] = { d[0] - a[0], d[1] - a[1], d[2] - a[2] };

    for (int j = 0; j < 3; ++j)
    {
        int offset = 3 * j;

        A[offset + 0] = u[j];
        A[offset + 1] = v[j];
        A[offset + 2] = w[j];
    }

    three_by_three_inverse(A);

    /* .. */
}

This is called by the global function :
[code]
__global__
void compute_distance(point *points, tetra *mesh, int *pa, int *ta, float *dist, int n, PredicateInfo *preds)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    
    if (i >= n)
        return;

    if (pa[i] < 0 )
        return;

    tetra *t = mesh + ta[i];
    //print_tetra(t);

    if (contain_test(t, points, pa[i]))
    {
       /* ... */

I think everything was stored successfully in a row-major order but I’m not entirely sure. Seems like it worked.

But I’m just curious, how do I approach optimizing something like this? I hope what I posted wasn’t too long. But I think this might be a serious bottleneck for me just due to how man reads I have.

The thing is, the memory accesses can be totally random so I’m not sure if I can coalesce anything unless by sheer dumb luck.

I think shared memory might be a possibility but I’m not sure how to fill the block. I know I can swap shared block size and on-board cache memory so I think I might get a good performance boost there. If I did use shared memory, I’d most likely have to write an attached memory manager and I could very easily use the built-in one by swapping the size of the spaces.

I’ve heard textures were a thing. Is this one of those things where it might come in handy?

Robert_Crovella · August 18, 2014, 4:56pm

Do you want to try using the CUBLAS Sgetrf/Sgetri method? It’s supposed to be designed for this type of thing (inverse of a large number of small matrices). I don’t know, a batch of 3x3 might be small even for that, however it might be a useful starting point to see where your code is at by benchmarking a comparison.

MutantJohn · August 18, 2014, 4:59pm

Ooh, that’s a really good idea. I actually haven’t really used CUBLAS.

Don’t tell anyone but for some reason I’ve had issues using those routines in the past. Do you happen to have a small code example of something that just works? Like, just compiling. I’m not sure how to link CUBLAS stuff into my code.

Robert_Crovella · August 18, 2014, 5:33pm

This thread links to one example:

[url]https://devtalk.nvidia.com/default/topic/761729/gpu-accelerated-libraries/cublasdgetrfbatched-and-cublasdgetribatched/[/url]

MutantJohn · August 18, 2014, 5:37pm

Thank you!!!

MutantJohn · August 18, 2014, 9:38pm

Alright, so I keep getting errors about undefined references to the cuBLAS functions I’m calling.

Here’s how I’m compiling :

nvcc -O3 -lstdc++ -lcublas -lcudart -lcublas_device.a -arch sm_35 --relocatable-device-code true

And here’s the code :

/*
    Returns true or false if a point is contained
    inside a particular tetrahedron
*/

__device__
int contain_test(tetra *t, point *points, int pid)
{
    float *a = (points + t->p[0])->p;
    float *b = (points + t->p[1])->p;
    float *c = (points + t->p[2])->p;
    float *d = (points + t->p[3])->p;

    float *e = (points + pid)->p;    


    float A[9] = { 0 };

    float u[3] = { b[0] - a[0], b[1] - a[1], b[2] - a[2] };
    float v[3] = { c[0] - a[0], c[1] - a[1], c[2] - a[2] };
    float w[3] = { d[0] - a[0], d[1] - a[1], d[2] - a[2] };

    for (int j = 0; j < 3; ++j)
    {
        int offset = 3 * j;

        A[offset + 0] = u[j];
        A[offset + 1] = v[j];
        A[offset + 2] = w[j];
    }
cublasHandle_t handle;
cublasCreate(&handle);
int p[3], info[1];
float *test[] = { A };
cublasSgetrfBatched(handle, 3, test, 3, p, info, 1);

/* ... */

Robert_Crovella · August 18, 2014, 10:06pm

Suggestions to write effective questions asking for help:

Please provide a short complete code that demonstrates the issue, not a snippet. And I’m not asking for pages and pages of code. You should be able to create a complete reproducer of this compile error using not many more lines of code than what you have shown already. But leaving off things like #include files means we have to have this back-and-forth dialog, which is inefficient and time consuming.
Also, when you have trouble like this, don’t describe the error, paste the actual error text into the question. In fact, provide a complete session transcript just like was shown in the example I linked (that started with cat t340.cu i.e. of the source file, then the compile command, then the output) This isn’t hard to do, and will make you a much better writer of questions asking for help.

Do you have

#include <cublas_v2.h>

somewhere ?

MutantJohn · August 18, 2014, 11:09pm

Lol I realized I forgot to post that yes, I do have #include <cublas_v2.h" included.

Here’s the actual error :
nvlink error : Undefined reference to ‘cublasCreate_v2’ in ‘tetra.o’
nvlink error : Undefined reference to ‘cublasSgetrfBatched’ in ‘tetra.o’

I’m not sure how much more info you need but is it because I’m trying to call cuBLAS routines from the device?

Robert_Crovella · August 18, 2014, 11:16pm

OK so it’s an error at link time, not compile time.

Are you on a Mac?

This doesn’t look quite right:

nvcc -O3 -lstdc++ -lcublas -lcudart -lcublas_device.a -arch sm_35 --relocatable-device-code true

I think you should specify

-lcublas_device

not

-lcublas_device.a

for the device library. And I think you are missing:

-lcudadevrt

which is also needed for dynamic parallelism.

-lcudart

should be unnecessary. nvcc takes care of that by default

Robert_Crovella · August 18, 2014, 11:28pm

This cuda sample code (and makefile) will show you how to build an application that uses the cublas library from device code:

[url]CUDA Samples :: CUDA Toolkit Documentation

MutantJohn · August 19, 2014, 5:10am

Hey, thanks for all the help.

I finally got it all to compile.

Topic		Replies	Views
Inverse of a 3x3 matrix CUDA Programming and Performance	22	9797	February 20, 2024
Matrix Inversion with cublasSgetri GPU-Accelerated Libraries	17	8959	April 10, 2019
Matrix inverse usng linear system solver through cublas , cublasCreate exception or something else CUDA Programming and Performance	1	4683	June 16, 2013
Inverse one matrix in cuda CUDA Programming and Performance	1	3138	July 4, 2018
Calling CUBLAS' cublasDgetrfBatched proper procedure Legacy PGI Compilers	3	4942	January 30, 2017
Trouble with cublas and matrix inverse in Fortran Legacy PGI Compilers	1	1973	April 9, 2018
32x32 matrix invert NOT using cuBLAS? CUDA Programming and Performance	1	554	May 24, 2016
Matrix inverse with cublas or cusolver nvc, nvc++ and nvfortran	2	1255	January 10, 2023
CULAPACK CUDA Programming and Performance	9	18122	April 30, 2009
Matrix Inversion source CUDA Programming and Performance	7	25313	March 6, 2012

Can someone help me optimize some 3x3 matrix inverse code?

Related topics