Can someone help me optimize some 3x3 matrix inverse code?

Hey guys,

I’m still kind of new to CUDA. But basically, I need the 3x3 inverse of a matrix of floats. I figured the easiest and fastest way to do this would be to brute force it. This way I’m not doing any LU stuff. I actually forgot the real name of it. Is it a left upper diagonalization? Either way, that is only good for large matrices when what I’m doing is a million 3x3 inverses instead of one inverse on a million by million array.

So, points are stored like this :

It is also worth noting that all points are stored in one contiguous array, created by CudaMalloc().

struct point
{
   union
   {
      float p[3];
      struct { float x, y, z; };
   };
};

And these are held in a tetrahedron by :

struct tetra
{
    int p[4]; // vertices

    /* ... */
};

And here’s how I brute-force the 3x3 inverse :

__device__
inline float two_by_two_determinant(float *m )
{
    return m[0] * m[3] - m[1] * m[2];
}

__device__
inline float three_by_three_determinant(float *m)
{
    float u[4] = { m[4], m[5], m[7], m[8] };
    float v[4] = { m[3], m[5], m[6], m[8] };
    float w[4] = { m[3], m[4], m[6], m[7] };

    return m[0] * two_by_two_determinant(u) -
           m[1] * two_by_two_determinant(v) +
           m[2] * two_by_two_determinant(w);
}

__device__
void three_by_three_inverse(float *m)
{
    float a0[4] = { m[4], m[5], m[7], m[8] };
    float a1[4] = { m[2], m[1], m[8], m[7] };
    float a2[4] = { m[1], m[2], m[4], m[5] };
    float a3[4] = { m[5], m[3], m[8], m[6] };
    float a4[4] = { m[0], m[2], m[6], m[8] };
    float a5[4] = { m[2], m[0], m[5], m[3] };
    float a6[4] = { m[3], m[4], m[6], m[7] };
    float a7[4] = { m[1], m[0], m[7], m[6] };
    float a8[4] = { m[0], m[1], m[3], m[4] };

    float divisor = 1 / three_by_three_determinant(m);

    float n[9] = { two_by_two_determinant(a0),
                   two_by_two_determinant(a1),
                   two_by_two_determinant(a2),
                   two_by_two_determinant(a3),
                   two_by_two_determinant(a4),
                   two_by_two_determinant(a5),
                   two_by_two_determinant(a6),
                   two_by_two_determinant(a7),
                   two_by_two_determinant(a8)
                 };

    #pragma unroll
    for (int i = 0; i < 9; ++i) 
    {
        m[i] = n[i] * divisor;
    }
}

This function is called with something like this :

__device__
int contain_test(tetra *t, point *points, int pid)
{
    float *a = (points + t->p[0])->p;
    float *b = (points + t->p[1])->p;
    float *c = (points + t->p[2])->p;
    float *d = (points + t->p[3])->p;

    float *e = (points + pid)->p;    

float A[9] = { 0 };

    float u[3] = { b[0] - a[0], b[1] - a[1], b[2] - a[2] };
    float v[3] = { c[0] - a[0], c[1] - a[1], c[2] - a[2] };
    float w[3] = { d[0] - a[0], d[1] - a[1], d[2] - a[2] };

    for (int j = 0; j < 3; ++j)
    {
        int offset = 3 * j;

        A[offset + 0] = u[j];
        A[offset + 1] = v[j];
        A[offset + 2] = w[j];
    }

    three_by_three_inverse(A);

    /* .. */
}

This is called by the global function :
[code]
__global__
void compute_distance(point *points, tetra *mesh, int *pa, int *ta, float *dist, int n, PredicateInfo *preds)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    
    if (i >= n)
        return;

    if (pa[i] < 0 )
        return;

    tetra *t = mesh + ta[i];
    //print_tetra(t);

    if (contain_test(t, points, pa[i]))
    {
       /* ... */

I think everything was stored successfully in a row-major order but I’m not entirely sure. Seems like it worked.

But I’m just curious, how do I approach optimizing something like this? I hope what I posted wasn’t too long. But I think this might be a serious bottleneck for me just due to how man reads I have.

The thing is, the memory accesses can be totally random so I’m not sure if I can coalesce anything unless by sheer dumb luck.

I think shared memory might be a possibility but I’m not sure how to fill the block. I know I can swap shared block size and on-board cache memory so I think I might get a good performance boost there. If I did use shared memory, I’d most likely have to write an attached memory manager and I could very easily use the built-in one by swapping the size of the spaces.

I’ve heard textures were a thing. Is this one of those things where it might come in handy?

Do you want to try using the CUBLAS Sgetrf/Sgetri method? It’s supposed to be designed for this type of thing (inverse of a large number of small matrices). I don’t know, a batch of 3x3 might be small even for that, however it might be a useful starting point to see where your code is at by benchmarking a comparison.

Ooh, that’s a really good idea. I actually haven’t really used CUBLAS.

Don’t tell anyone but for some reason I’ve had issues using those routines in the past. Do you happen to have a small code example of something that just works? Like, just compiling. I’m not sure how to link CUBLAS stuff into my code.

This thread links to one example:

[url]https://devtalk.nvidia.com/default/topic/761729/gpu-accelerated-libraries/cublasdgetrfbatched-and-cublasdgetribatched/[/url]

Thank you!!!

Alright, so I keep getting errors about undefined references to the cuBLAS functions I’m calling.

Here’s how I’m compiling :

nvcc -O3 -lstdc++ -lcublas -lcudart -lcublas_device.a -arch sm_35 --relocatable-device-code true

And here’s the code :

/*
    Returns true or false if a point is contained
    inside a particular tetrahedron
*/

__device__
int contain_test(tetra *t, point *points, int pid)
{
    float *a = (points + t->p[0])->p;
    float *b = (points + t->p[1])->p;
    float *c = (points + t->p[2])->p;
    float *d = (points + t->p[3])->p;

    float *e = (points + pid)->p;    


    float A[9] = { 0 };

    float u[3] = { b[0] - a[0], b[1] - a[1], b[2] - a[2] };
    float v[3] = { c[0] - a[0], c[1] - a[1], c[2] - a[2] };
    float w[3] = { d[0] - a[0], d[1] - a[1], d[2] - a[2] };

    for (int j = 0; j < 3; ++j)
    {
        int offset = 3 * j;

        A[offset + 0] = u[j];
        A[offset + 1] = v[j];
        A[offset + 2] = w[j];
    }
cublasHandle_t handle;
cublasCreate(&handle);
int p[3], info[1];
float *test[] = { A };
cublasSgetrfBatched(handle, 3, test, 3, p, info, 1);

/* ... */

Suggestions to write effective questions asking for help:

  1. Please provide a short complete code that demonstrates the issue, not a snippet. And I’m not asking for pages and pages of code. You should be able to create a complete reproducer of this compile error using not many more lines of code than what you have shown already. But leaving off things like #include files means we have to have this back-and-forth dialog, which is inefficient and time consuming.

  2. Also, when you have trouble like this, don’t describe the error, paste the actual error text into the question. In fact, provide a complete session transcript just like was shown in the example I linked (that started with cat t340.cu i.e. of the source file, then the compile command, then the output) This isn’t hard to do, and will make you a much better writer of questions asking for help.

Do you have

#include <cublas_v2.h>

somewhere ?

Lol I realized I forgot to post that yes, I do have #include <cublas_v2.h" included.

Here’s the actual error :
nvlink error : Undefined reference to ‘cublasCreate_v2’ in ‘tetra.o’
nvlink error : Undefined reference to ‘cublasSgetrfBatched’ in ‘tetra.o’

I’m not sure how much more info you need but is it because I’m trying to call cuBLAS routines from the device?

OK so it’s an error at link time, not compile time.

Are you on a Mac?

This doesn’t look quite right:

nvcc -O3 -lstdc++ -lcublas -lcudart -lcublas_device.a -arch sm_35 --relocatable-device-code true

I think you should specify

-lcublas_device

not

-lcublas_device.a

for the device library. And I think you are missing:

-lcudadevrt

which is also needed for dynamic parallelism.

-lcudart

should be unnecessary. nvcc takes care of that by default

This cuda sample code (and makefile) will show you how to build an application that uses the cublas library from device code:

[url]CUDA Samples :: CUDA Toolkit Documentation

Hey, thanks for all the help.

I finally got it all to compile.