Matrix Inversion source


I want to share my matrix inversion. It is based on an blocked Gauß elimination Algorithm. I found this algorithm in an ppt from “Christian Heinrich”. Gauselimination
I build the rest to invert a matrix.

This is one of my first CUDA projects, maybe there are already better implementations but i didn’t found one.

I Used the VS_Project Wizard 1.2 with a custom cuda.rules (included in the zip). There is an test envirement for matlab included, so you can easily load it as a library.

The results where the expected one, i got a performance boost if the matrixsize
is bigger than 128x128.

I need this algorithm to solve an matrix equation. So if you got an faster one so tell me,please =)

I run my tests with a
Asus Gforce GTX260 (192shader@1,24GHz)
AMD x2 3800+ am2 (2,0GHz)

small matrix sizes
larger matrix sizes

Ch. Wagner

update: documentation (545 KB)


Nice work! Just wanted to know if you have ever written anything for blocking of 3d matrices? Any reference code u have available?



I’m guessing the performance charts are using timings with a straight-forward (not highly optimized) CPU implementation? Otherwise I don’t follow why Matlab would be that much faster. Do you happen to have timing results with some optimized math library (along the lines of MKL, etc.)?


Sorry, i havent anything about solving 3D matrices. I am happy that i finished work with the 2D stuff :))

Yes the CPU implementation isn’t optimzed, i used the standard MS c compiler and there are no SIMD optimisations in the code. You can take a look at the reference c code in the zip archiv. I also surpriced that the MatLab version is very fast. In the MatLab documentaion is a reference that they are using the LAPACK library for the calculation of the inverse matrix. I thought that LAPACK is a fully optimized library for algebra problems. I will complete my docu. and will add some messurement with other optimized libraries and other GPUs.

i tried compiling ur code but got an error “cutil32D.dll missing” even though the file is present.
do you have a code in which cpu and gpu code are in the same file? i’m not good with the project you have posted. i’m looking for a code which compute inversion on cpu and gpu in the same code.

I got your code to compile fine, but your test example is flawed, the memory addressing wasn’t correct. when fixed, the code no longer validates on the inverse of the identity matrix. I changed your matrix generation code to:

tmp = dataInput;
for(int i =0; i <size;i++) {
for (int j=0; j < size;j++) {
if (i==j) {
} else {

The error gets better as the number increases, but at this time it is not working correctly. I will look into your code and to compare and contrast.


I found a way to “**** up” CULA in order to create inverse matrix in free edition(normally that function available in standard edition) for dense matrices. Everything is pretty simple. There is available function, that solves linear equations of type A * X = B(1) called GESV. Where is B is NOT vector, but matrix. Therefor that function solves multiple equations with different right parts, specified in columns of matrix B.

So let’s get down to business. It is known that A*A^(-1) = E(2). Where is E – matrix with ones in diagonal. Let’s assume that X in formula (1) is A^(-1) and B in formula (1) is E. Voila!

Of course this is obvious thing. But maybe certain folks don’t about know that :)

Source code:

# include <cula.h>

# define THREADS 512

__global__ void generateEye(float *E, int size);

int matrixInvert(float *devInput, float *devOutput, int size){

    int *ipiv;

    float *cache;

    int size2 = size * size * sizeof(float);

    int blocks = (size * size) / THREADS + 1;

    cudaMalloc(&ipiv, size2);

    cudaMalloc(&cache, size2);

    generateEye<<<blocks, THREADS>>>(devOutput, size);

    cudaMemcpy(cache, devInput, size2, cudaMemcpyDeviceToDevice);


    culaDeviceSgesv(size, size, cache, size, ipiv, devOutput, size);




    return 0;


__global__ void generateEye(float *E, int size){

    int offset = blockIdx.x * blockDim.x + threadIdx.x;

    int y = offset / size;

    int x = offset - y * size;

    if(offset > size * size) return;

    if(x == y) E[offset] = 1;

    else E[offset] = 0;


CULA is fin tool. But i prefer to not pay for software. Greetings from Ukraine :)

Hi, I downloaded your method, but I’d like to implement one by myself but the doc files are written in German, Is there a way I could get a copy from the documentation file in English? And also I would like to contact you by e-mail