Hey guys,

I’m still kind of new to CUDA. But basically, I need the 3x3 inverse of a matrix of floats. I figured the easiest and fastest way to do this would be to brute force it. This way I’m not doing any LU stuff. I actually forgot the real name of it. Is it a left upper diagonalization? Either way, that is only good for large matrices when what I’m doing is a million 3x3 inverses instead of one inverse on a million by million array.

So, points are stored like this :

It is also worth noting that all points are stored in one contiguous array, created by CudaMalloc().

```
struct point
{
union
{
float p[3];
struct { float x, y, z; };
};
};
```

And these are held in a tetrahedron by :

```
struct tetra
{
int p[4]; // vertices
/* ... */
};
```

And here’s how I brute-force the 3x3 inverse :

```
__device__
inline float two_by_two_determinant(float *m )
{
return m[0] * m[3] - m[1] * m[2];
}
__device__
inline float three_by_three_determinant(float *m)
{
float u[4] = { m[4], m[5], m[7], m[8] };
float v[4] = { m[3], m[5], m[6], m[8] };
float w[4] = { m[3], m[4], m[6], m[7] };
return m[0] * two_by_two_determinant(u) -
m[1] * two_by_two_determinant(v) +
m[2] * two_by_two_determinant(w);
}
__device__
void three_by_three_inverse(float *m)
{
float a0[4] = { m[4], m[5], m[7], m[8] };
float a1[4] = { m[2], m[1], m[8], m[7] };
float a2[4] = { m[1], m[2], m[4], m[5] };
float a3[4] = { m[5], m[3], m[8], m[6] };
float a4[4] = { m[0], m[2], m[6], m[8] };
float a5[4] = { m[2], m[0], m[5], m[3] };
float a6[4] = { m[3], m[4], m[6], m[7] };
float a7[4] = { m[1], m[0], m[7], m[6] };
float a8[4] = { m[0], m[1], m[3], m[4] };
float divisor = 1 / three_by_three_determinant(m);
float n[9] = { two_by_two_determinant(a0),
two_by_two_determinant(a1),
two_by_two_determinant(a2),
two_by_two_determinant(a3),
two_by_two_determinant(a4),
two_by_two_determinant(a5),
two_by_two_determinant(a6),
two_by_two_determinant(a7),
two_by_two_determinant(a8)
};
#pragma unroll
for (int i = 0; i < 9; ++i)
{
m[i] = n[i] * divisor;
}
}
```

This function is called with something like this :

```
__device__
int contain_test(tetra *t, point *points, int pid)
{
float *a = (points + t->p[0])->p;
float *b = (points + t->p[1])->p;
float *c = (points + t->p[2])->p;
float *d = (points + t->p[3])->p;
float *e = (points + pid)->p;
float A[9] = { 0 };
float u[3] = { b[0] - a[0], b[1] - a[1], b[2] - a[2] };
float v[3] = { c[0] - a[0], c[1] - a[1], c[2] - a[2] };
float w[3] = { d[0] - a[0], d[1] - a[1], d[2] - a[2] };
for (int j = 0; j < 3; ++j)
{
int offset = 3 * j;
A[offset + 0] = u[j];
A[offset + 1] = v[j];
A[offset + 2] = w[j];
}
three_by_three_inverse(A);
/* .. */
}
This is called by the global function :
[code]
__global__
void compute_distance(point *points, tetra *mesh, int *pa, int *ta, float *dist, int n, PredicateInfo *preds)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i >= n)
return;
if (pa[i] < 0 )
return;
tetra *t = mesh + ta[i];
//print_tetra(t);
if (contain_test(t, points, pa[i]))
{
/* ... */
```

I think everything was stored successfully in a row-major order but I’m not entirely sure. Seems like it worked.

But I’m just curious, how do I approach optimizing something like this? I hope what I posted wasn’t too long. But I think this might be a serious bottleneck for me just due to how man reads I have.

The thing is, the memory accesses can be totally random so I’m not sure if I can coalesce anything unless by sheer dumb luck.

I think shared memory might be a possibility but I’m not sure how to fill the block. I know I can swap shared block size and on-board cache memory so I think I might get a good performance boost there. If I did use shared memory, I’d most likely have to write an attached memory manager and I could very easily use the built-in one by swapping the size of the spaces.

I’ve heard textures were a thing. Is this one of those things where it might come in handy?