Matrix Inversion Code

Hello, I’m trying to deploy a matrix inversion code, however I’m having a problem I do not know how to solve it.The article I’m using is called “Gauss-Jordan Matrix Inversion Speed-Up using GPUswith the Consideration of Power Consumption”
It consists of taking an n × n matrix, extending it to n x 2n, where the left half is the input matrix and the right side is the identity matrix.
I was able to implement code that works for matrix with n <= 1022, but for larger matrix the process does not work.

The code below shows how I’m allocating the threads and calling the kernel function.

int N = (n / 1024) + 1;
dim3 dimGrid((n / 1024) + 1, n, 1);
for(int i = 0; i < n; i++){
	fixAll<< <dimGrid, (n+2)/N>> >(d_A, n, i);

The code below shows the kernel function.

global void fixAll (float *Matrix, int n, int collId){
int T = blockIdx.x * blockDim.x + threadIdx.x;

if (T + collId < n*2){
	int TIndex = threadIdx.x;
	int B = blockIdx.y;
	float D = Matrix[collId*(n*2)+collId];
	__shared__ float SharedRow[1024];
	SharedRow[TIndex] = Matrix[collId*(n*2) + collId + T];
	 float C;
	C = Matrix[B * (n*2) + collId] / D;
	if(B == collId){
		Matrix[B * (n*2) + T + collId] /= D;
		Matrix[B * (n*2) + T + collId] -= C * SharedRow[TIndex];


What happens is that the kernel only returns “nan” as a response, not the inverse matrix.Doing some tests I realized until the line “Matrix[B * (n*2) + T + collId] /= D;” the code works, but, in the line “Matrix[B * (n*2) + T + collId] -= C * SharedRow[TIndex];” is where the mistake happens.

Doing some more tests I think the problem is in variable C, but I do not understand what might be wrong when affecting the result, since the same code works with dimension matrix n <= 1022.

Please, can anyone help me understand what’s going on? Thank you very much in advance.

Sorry for the bad English, I used google translate.

This is the link if you want to download the article

Add proper error checking to your code, and the problem will be obvious. If n == 1023, then

int N = (n / 1024) + 1;   // N=1
fixAll<<<dimGrid, 1025>>>(d_A, 1023, i);

You can’t have a thread block with 1025 threads (see CUDA documentation). The limit is 1024 threads, which is why your code works for n <= 1022.

Thanks for the answer. I had not thought of this possibility that you indicated because the array I am using has n = 2000, so the total threads per block is less than 1024 for this case. Anyway I will try to review this question. Thank you!

Reading the paper you are referencing:

the authors call out a “sync all threads” in their figures 1 and 2, and furthermore reference the use of syncthreads in the text.

I don’t see that in your code. Apart from that, I have concerns about the paper in general, as it appears that the authors believe they can synchronize threads across threadblocks with syncthreads, which is not the case.

The original paper:

had no such assumptions, and broke the work down into a sequence of 2 kernels that were called in a loop. The kernel launches in this case provide the necessary matrix/device-wide synchronization. I haven’t studied either paper in detail.

Thank you very much for the information. I’m going to study them and try to solve the problems.

Hello! I did not get into the code for a while, but I re-implemented it and I have more issues.

I decided to use this second papper that you indicated. Of the two kernels inidicados, the first works perfectly.
But the second only works with arrays of type 10x10, 20x20, if I have with larger arrays, like 100x100 the final result is incorrect, with some wrong values.

This is the kernel that has problems:

__global__ void fix_Column (double *Matrix, double *Idn, int n, int colID){
	int i = blockIdx.x * blockDim.x + threadIdx.x;
	int j = blockIdx.y * blockDim.y + threadIdx.y;
		if((i < n) && (j < n)){
		int TIndex = threadIdx.x;
		__shared__ double AColIdj;
		__shared__ double BColIdj;
		__shared__ double col_M[32];
		__shared__ double colj_M[32];
		__shared__ double colj_Idn[32];
		col_M[TIndex] = Matrix[i * n + colID]; // The colID column of the originaly matrix
			if (col_M[TIndex] != 0){
				colj_M[TIndex] = Matrix[i * n + j]; // The jth column of the originaly matrix
				colj_Idn[TIndex] = Idn[i * n + j];  // The jth column of the identity matrix
				AColIdj = Matrix[colID * n + j];	// The jth element of the colID row in originaly matrix
				BColIdj = Idn[colID * n + j];		// The jth element of the colID row in identity matrix
				if (i != colID){
					colj_M[TIndex] -= AColIdj * col_M[TIndex];
					colj_Idn[TIndex] -= BColIdj * col_M[TIndex]; 	
				Matrix[i * n + j] = colj_M[TIndex];
				Idn[i * n + j] = colj_Idn[TIndex];

The threads are being allocated as follows:

dim3 dimGrid_Col((n - 1) / 32 + 1, (n - 1) / 32 + 1, 1);
	dim3 dimBlock_Col(32, 32, 1);

for (int i = 0; i < n; i++){	
		fix_linha << <dimGrid, dimBlock>> >(d_A, dIdent, n, i); //kernel running
		fix_Column << <dimGrid_Col, dimBlock_Col>> >(d_A, dIdent, n, i);

“d_A” is original matrix
“dIdent” is the identity matrix

Anyone have any idea what might be happening? I have tried to set up the code in different ways, but since I am a newbie in CUDA, and where I am studying nobody has any knowledge of this technology, I am a little lost hahaha.
Maybe I’m asking for a lot, but I really want to implement this part for my project. Just to present the CUDA to the students and teachers of the institution where I study. Thank you!!