Fast matrix 3x3 inversion etc

rmkeene · July 12, 2021, 6:55pm

I have some matrix code that unwraps fringe image phase, then does matrix inversion. This needs to be as fast as possible. There are 76800 (320 x 240 point cloud points) matrices to invert and then vector multiply.

The following code does this and the style is C code array math.
Does converting this to CUDA intrinsics make this faster, or is the compiler already generating SIMD instructions?

// Mc and Mp are each 3 rows by 4 columns float.
// A is 3x3 float, B is 3x1 float.
// uc and vc are float
A[0] = dt->Mc[8] * uc - dt->Mc[0];
A[1] = dt->Mc[9] * uc - dt->Mc[1];
A[2] = dt->Mc[10] * uc - dt->Mc[2];
A[3] = dt->Mc[8] * vc - dt->Mc[4];
A[4] = dt->Mc[9] * vc - dt->Mc[5];
A[5] = dt->Mc[10] * vc - dt->Mc[6];
A[6] = dt->Mp[8] * uphase - dt->Mp[0];
A[7] = dt->Mp[9] * uphase - dt->Mp[1];
A[8] = dt->Mp[10] * uphase - dt->Mp[2];

B[0] = dt->Mc[3] - dt->Mc[11] * uc;
B[1] = dt->Mc[7] - dt->Mc[11] * vc;
B[2] = dt->Mp[3] - dt->Mp[11] * uphase;

// A is 3 x 3, B is 1 x 3, XYZ is 1 x 3
// XYZ = A.inv() * B;
// Invert A  
const float det =
	A[0] * (A[4] * A[8] - A[7] * A[5]) -
	A[1] * (A[3] * A[8] - A[5] * A[6]) +
	A[2] * (A[3] * A[7] - A[4] * A[6]);
const float invdet = 1.0f / det;
Ainv[0] = (A[4] * A[8] - A[7] * A[5]) * invdet;
Ainv[1] = (A[2] * A[7] - A[1] * A[8]) * invdet;
Ainv[2] = (A[1] * A[5] - A[2] * A[4]) * invdet;
Ainv[3] = (A[5] * A[6] - A[3] * A[8]) * invdet;
Ainv[4] = (A[0] * A[8] - A[2] * A[6]) * invdet;
Ainv[5] = (A[3] * A[2] - A[0] * A[5]) * invdet;
Ainv[6] = (A[3] * A[7] - A[6] * A[4]) * invdet;
Ainv[7] = (A[6] * A[1] - A[0] * A[7]) * invdet;
Ainv[8] = (A[0] * A[4] - A[3] * A[1]) * invdet;


const float X = (float)(Ainv[0] * B[0] + Ainv[1] * B[1] + Ainv[2] * B[2]);
const float Y = (float)(Ainv[3] * B[0] + Ainv[4] * B[1] + Ainv[5] * B[2]);
const float Z = (float)(Ainv[6] * B[0] + Ainv[7] * B[1] + Ainv[8] * B[2]);

dt->xyzDataOut[idx].Pos[0] = (xflip ? -X : X) + xo;
dt->xyzDataOut[idx].Pos[1] = (yflip ? -Y : Y) + yo;
dt->xyzDataOut[idx].Pos[2] = (zflip ? -Z : Z) + zo;

The out global call is like

phase2XYZ_global <<<2, 512 >>> (uwdp);

and udwp is the unwrap parameters structure.

Robert_Crovella · July 12, 2021, 7:08pm

You mean this is already part of some CUDA kernel code you have written? Or that you are thinking of writing some CUDA kernel code like this? Or something else?

That is a confusing question to me. Which intrinsics were you thinking of, specifically? There aren’t any CUDA SIMD instructions that work on float data. The majority of CUDA mathematical intrinsics are documented here.

For a batch 3x3 matrix inversion, you can use CUBLAS for this (recommended), or else if you wish to write your own CUDA kernel, this may be of interest.

rmkeene · July 12, 2021, 7:27pm

This is code that I have written and it works. Just trying to get from 42 milliseconds down below 20 milliseconds.

Thanks for the links.

njuffa · July 12, 2021, 9:19pm

I have written high-performance CUDA code for batch inversion of tiny float matrices before.

(1) I assigned one matrix to each thread
(2) I used Gauss-Jordan for the inversion with partial pivoting as an optional feature.
(3) I assigned all matrix elements to float scalars at the start, and assigned back to a matrix at the end
(4) I maximized the use of and explicitly coded fmaf()

I seem to recall that this scheme worked well up to about 6x6 matrices, at which point register pressure became an issue. With 76K matrices to invert in your use case, handling one matrix per thread seems very much a suitable scheme.

rmkeene · July 12, 2021, 9:22pm

Thanks

Topic		Replies	Views
Inverse of a 3x3 matrix CUDA Programming and Performance	22	9380	February 20, 2024
Matrix Inversion source CUDA Programming and Performance	7	25230	March 6, 2012
Incorrect inversion for matrix with CUBLAS CUDA Programming and Performance	4	1293	December 22, 2016
About Matrix Inversion CUDA Programming and Performance	7	3234	December 4, 2012
troubles in matrix inversion takes more time than cpu CUDA Programming and Performance	4	1185	March 7, 2012
Matrix Inversion Code CUDA Programming and Performance	5	725	December 8, 2018
Matrix Inversion CUDA Programming and Performance	3	4856	September 24, 2008
Can someone help me optimize some 3x3 matrix inverse code? CUDA Programming and Performance	10	1922	August 19, 2014
numerous, but small-sized matrix inversions looking 4 advise how-to speed-up problem CUDA Programming and Performance	4	3151	August 20, 2008
Optimization of kernel for batch convolution of many small matrices CUDA Programming and Performance	4	1742	August 1, 2013

Fast matrix 3x3 inversion etc

Related topics