Matrix inverse accuracy errors

Hi y’all,

I’ve been trying to implement the inverse of a matrix, I’m not trying to solve any particular system of equations this is part of a project I have. There are 2 things I want help with:

[list=1]

1 If I use small matrices (from 2x2 to 50x50) my program executes well but the bigger the matrix the bigger the lost of accuracy and also in this case is much more faster the host than the device, so I conclude I should not use my parallel program for small matrices BUT why am I loosing to much accuracy?

2 If my matrices get bigger (from 100x100 to 2000x2000) the result becomes NaN for all the inverse matrix, in this case I don’t know what’s happening, any ideas?

I’ll post my code so you can see what I’m doing any suggestions/ideas/recommendations are very welcomed

``````__global__ void kernelPivotReduction(int idxI, int col, float *matrixf, float *identity, float pivot)

{

int idxJ = blockIdx.x * blockDim.x + threadIdx.x;

while (idxJ < col)

{

matrixf[idxI*col+idxJ] = matrixf[idxI*col+idxJ]/pivot;

identity[idxI*col+idxJ] = identity[idxI*col+idxJ]/pivot;

idxJ += blockDim.x * gridDim.x;

}

}

__global__ void kernelRowReduction(int idxI, int col, float *matrixf, float *identity, float pivot, int idxK)

{

int idxJ = blockIdx.x * blockDim.x + threadIdx.x;

while (idxJ < col)

{

matrixf[idxK*col+idxJ] = -1*pivot*matrixf[idxI*col+idxJ] + matrixf[idxK*col+idxJ];

identity[idxK*col+idxJ] = -1*pivot*identity[idxI*col+idxJ] + identity[idxK*col+idxJ];

idxJ += blockDim.x * gridDim.x;

}

}

void pprInverse(pprMatrix *matrix, pprMatrixf *identity)

{

float pivot;

float elapsedTime;

float *d_matrixf;

float *d_identity;

pprMatrixf matrixf;

u_int idxI, idxK;

cudaEvent_t start;

cudaEvent_t stop;

dim3 blocks(5,5);   //the number of threads and blocks isn't still tuned !

//Change from u_char to float to avoid lost of precision.

pprMatrix2Matrixf(matrix, &matrixf);

//Create start, stop events.

CHECK_ERROR( cudaEventCreate( &start ) );

CHECK_ERROR( cudaEventCreate( &stop ) );

//Launch the start event.

CHECK_ERROR( cudaEventRecord( start, 0 ) );

//Allocate memory on the GPU.

CHECK_ERROR( cudaMalloc((void**)&d_matrixf, sizeof(float)*matrixf.row*matrixf.col) );

CHECK_ERROR( cudaMalloc((void**)&d_identity, sizeof(float)*matrixf.row*matrixf.col) );

//Copy information from CPU(host) to GPU(device).

CHECK_ERROR( cudaMemcpy( d_matrixf, matrixf.data, sizeof(float)*matrixf.row*matrixf.col, cudaMemcpyHostToDevice ) );

CHECK_ERROR( cudaMemcpy( d_identity, identity->data, sizeof(float)*matrixf.row*matrixf.col, cudaMemcpyHostToDevice ) );

for (idxI = 0; idxI < matrixf.row; idxI++)

{

pivot = matrixf.data[idxI*matrixf.col+idxI];

kernelPivotReduction<<<blocks,threads>>>(idxI, matrixf.col, d_matrixf, d_identity, pivot);

for (idxK = 0; idxK < matrixf.row; idxK++)

{

if (idxK != idxI)

{

pivot = matrixf.data[idxK*matrixf.col+idxI];

kernelRowReduction<<<blocks,threads>>>(idxI, matrixf.col, d_matrixf, d_identity, pivot, idxK);

}

}

//Copy information from GPU(device) to CPU(host).

CHECK_ERROR( cudaMemcpy( matrixf.data, d_matrixf, sizeof(float)*matrixf.row*matrixf.col, cudaMemcpyDeviceToHost ) );

CHECK_ERROR( cudaMemcpy( identity->data, d_identity, sizeof(float)*matrixf.row*matrixf.col, cudaMemcpyDeviceToHost ) );

}

//Launch the stop event.

CHECK_ERROR( cudaEventRecord( stop, 0 ) );

CHECK_ERROR( cudaEventSynchronize( stop ) );

//Print the elapsed time.

CHECK_ERROR( cudaEventElapsedTime( &elapsedTime, start, stop ) );

printf( "Time to generate: %3.1f ms\n", elapsedTime );

//Destroy the events.

CHECK_ERROR( cudaEventDestroy( start ) );

CHECK_ERROR( cudaEventDestroy( stop ) );

//Free the allocated memory on GPU.

cudaFree(d_matrixf);

cudaFree(d_identity);

pprFreeMatrixfMem(&matrixf);

}
``````
1. I cannot understand why you sweep idxK from 0, to col-1
``````for (idxK = 0; idxK < matrixf.row; idxK++)

{

if (idxK != idxI)

...

}
``````

should it be

``````for (idxK = idxI+1; idxK < matrixf.row; idxK++)

{

....

}
``````
1. you use 2-D grid and 2-D thread block, but in your kernel, you only use partial indices, this incurs race condition.

If you persist in “int idxJ = blockIdx.x * blockDim.x + threadIdx.x;”, then try

``````dim3 blocks(5,0);   //the number of threads and blocks isn't still tuned !

``````

I’m doing inverse matrix using Gauss-Jordan method, so after making the pivot 1 I check (with IdxK) every row to make the numbers below/above the pivot zeros. That’s why I need to sweep all the matrix and then If the indexes are different (idxI != idxK, this means that the row I made the pivot 1 isn’t the one I’m trying to make zeros) I proceed with the row reduction.

Ooops, yeah I think you’re right I didn’t think about that…