Hello, Everyone!
I am new in CUDA development. I am trying implement Gaussian elimination and LU-factorization.
global void Kernel(Matrix A, Matrix C) {
int idx = blockIdx.x * blockDim.x+ threadIdx.x ;
int idy = blockIdx.y * blockDim.y+threadIdx.y ;
//Allocating memory in the share memory of the device
__shared__ float temp[5][7];
//Copying the data to the shared memory
temp[idy][idx] = A.elements[(idy * (6+1)) + idx] ;
for(int i =1 ; i<6 ;i++)
{
if((idy + i) < 6) // NO Thread divergence here
{
float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]);
temp[i+idy][idx] = temp[i-1][idx] +((var1) * (temp[i+idy ][idx]));
}
__syncthreads(); //Synchronizing all threads before Next iterat ion
}
C.elements[idy*(6+1) + idx] = temp[idy][idx];
}
and:
dim3 dimBlock(6+1,6,1);
dim3 dimGrid(1,1,1);
Kernel<<<dimGrid, dimBlock>>>(d_A, d_C);
This code works fine, but when I increase number of matrix elements to 4000 program crashes because of limit in 16Kb shared memory. There are my questions:
- How can I rewrite my program to achieve loading big matrices?
- Are there any implementations of Gaussian elimination or LU-factorization?
- Are there some people which can I talk or discuss some questions regarding this issue via Skype or e-mail with?
Thanks for any reply!