Gaussian elimination, LU-factorization and shared memory

Hello, Everyone!
I am new in CUDA development. I am trying implement Gaussian elimination and LU-factorization.
global void Kernel(Matrix A, Matrix C) {

int idx = blockIdx.x * blockDim.x+ threadIdx.x ; 
int idy = blockIdx.y * blockDim.y+threadIdx.y ; 

//Allocating memory in the share memory of the device 
__shared__ float temp[5][7]; 

//Copying the data to the shared memory 
temp[idy][idx] = A.elements[(idy * (6+1)) + idx] ;
for(int i =1 ; i<6 ;i++) 
	if((idy + i) < 6) // NO Thread divergence here 
		float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]); 
		temp[i+idy][idx] = temp[i-1][idx] +((var1) * (temp[i+idy ][idx]));

	__syncthreads(); //Synchronizing all threads before Next iterat ion 
	C.elements[idy*(6+1) + idx] = temp[idy][idx]; 


dim3 dimBlock(6+1,6,1);
dim3 dimGrid(1,1,1);

Kernel<<<dimGrid, dimBlock>>>(d_A, d_C);

This code works fine, but when I increase number of matrix elements to 4000 program crashes because of limit in 16Kb shared memory. There are my questions:

  1. How can I rewrite my program to achieve loading big matrices?
  2. Are there any implementations of Gaussian elimination or LU-factorization?
  3. Are there some people which can I talk or discuss some questions regarding this issue via Skype or e-mail with?

Thanks for any reply!

Terrible user name.

Google is your friend, you can find what you need via a 10-minute search.

There is a LU sample in the CUDA SDK, which works well for all sizes matrices.–cuda-dynamic-parallelism-

You will need a compute capability of 3.5 for that one.

Also there is also the batched version in cuBLAS:

If you want to represent the Matrix as sparse you can use cuSPARSE LU decomp: