# Fast matrix 3x3 inversion etc

I have some matrix code that unwraps fringe image phase, then does matrix inversion. This needs to be as fast as possible. There are 76800 (320 x 240 point cloud points) matrices to invert and then vector multiply.

The following code does this and the style is C code array math.
Does converting this to CUDA intrinsics make this faster, or is the compiler already generating SIMD instructions?

``````// Mc and Mp are each 3 rows by 4 columns float.
// A is 3x3 float, B is 3x1 float.
// uc and vc are float
A = dt->Mc * uc - dt->Mc;
A = dt->Mc * uc - dt->Mc;
A = dt->Mc * uc - dt->Mc;
A = dt->Mc * vc - dt->Mc;
A = dt->Mc * vc - dt->Mc;
A = dt->Mc * vc - dt->Mc;
A = dt->Mp * uphase - dt->Mp;
A = dt->Mp * uphase - dt->Mp;
A = dt->Mp * uphase - dt->Mp;

B = dt->Mc - dt->Mc * uc;
B = dt->Mc - dt->Mc * vc;
B = dt->Mp - dt->Mp * uphase;

// A is 3 x 3, B is 1 x 3, XYZ is 1 x 3
// XYZ = A.inv() * B;
// Invert A
const float det =
A * (A * A - A * A) -
A * (A * A - A * A) +
A * (A * A - A * A);
const float invdet = 1.0f / det;
Ainv = (A * A - A * A) * invdet;
Ainv = (A * A - A * A) * invdet;
Ainv = (A * A - A * A) * invdet;
Ainv = (A * A - A * A) * invdet;
Ainv = (A * A - A * A) * invdet;
Ainv = (A * A - A * A) * invdet;
Ainv = (A * A - A * A) * invdet;
Ainv = (A * A - A * A) * invdet;
Ainv = (A * A - A * A) * invdet;

const float X = (float)(Ainv * B + Ainv * B + Ainv * B);
const float Y = (float)(Ainv * B + Ainv * B + Ainv * B);
const float Z = (float)(Ainv * B + Ainv * B + Ainv * B);

dt->xyzDataOut[idx].Pos = (xflip ? -X : X) + xo;
dt->xyzDataOut[idx].Pos = (yflip ? -Y : Y) + yo;
dt->xyzDataOut[idx].Pos = (zflip ? -Z : Z) + zo;
``````

The out global call is like

phase2XYZ_global <<<2, 512 >>> (uwdp);

and udwp is the unwrap parameters structure.

You mean this is already part of some CUDA kernel code you have written? Or that you are thinking of writing some CUDA kernel code like this? Or something else?

That is a confusing question to me. Which intrinsics were you thinking of, specifically? There aren’t any CUDA SIMD instructions that work on `float` data. The majority of CUDA mathematical intrinsics are documented here.

For a batch 3x3 matrix inversion, you can use CUBLAS for this (recommended), or else if you wish to write your own CUDA kernel, this may be of interest.

1 Like

This is code that I have written and it works. Just trying to get from 42 milliseconds down below 20 milliseconds.

I have written high-performance CUDA code for batch inversion of tiny `float` matrices before.
(3) I assigned all matrix elements to `float` scalars at the start, and assigned back to a matrix at the end
(4) I maximized the use of and explicitly coded `fmaf()`