Solve Gaussian elimination with for loop

Hello!

I am trying to solve Gaussian elimination with CUDA.
I have a matrix N*N. To get new elements of this matrix I use such code on CPU, where C.width=N:

for(int z=0; z< C.width-1; z++)
{
	for ( int c = z+1 ; c < C.width ; c++ )
	{
		for (int d = z ; d < C.width ; d++ )
		{
			C.elements[c*C.width+d]=C.elements[c*C.width+d] - (B.elements[c*C.width+z]*C.elements[z*C.width+d]);
		}
	}
}

I am trying to implement it with CUDA. For example, N=512

dim3 dimBlock(16,16,1);
dim3 dimGrid(32,32,1);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

I think for every iteration should perform N-i*N threads to calculate elements
if(idx>511 || idy>510)
return;
for(int i=1; i<512;i++)
{
if(idx>=i-1 && idy>=i-1)
C.elements[(idy+1)*C.width+idx]=C.elements[(idy+1)*C.width+idx]-((C.elements[(idy+1)*C.width+(i-1)]/C.elements[(i-1)*C.width+(i-1)])*C.elements[(i-1)*C.width+idx]);

			__syncthreads();
		}
		}

Obtained result on GPU and CPU is the same, but time to solve this always is: Time(CPU)=2*Time(GPU)
For N=512: Time(CPU) = 1900 ms Time(GPU) = 980 ms
For N=1024: Time(CPU) = 14000 ms Time(GPU) = 7766 ms
.
.
.
I think speed-up must be larger than I have. Are there any mistakes in my parallel code? Can you help me how can I rewrite my code?

Thanks for any help!

You would be able to get a greater speed up by using shared memory. You should check out the matrix multiply example in the CUDA samples

docs.nvidia.com/cuda/cuda-samples/#matrix-multiplication--cuda-runtime-api-version-

Is your gpu code giving the right results? Because the _syncthreads() function only has effect inside a block and only when you use shared memory. The gaussian elimination is quite tough case since it is a reduction.

Gaussian elimination can be seen as a two steps procedure. The first step aims at transforming the linear system to an upper triangular linear system and the second consists of solving the so obtained upper triangular linear system. The second step is trivial in CUDA and can be efficiently performed by cublasStrsm. The first step, which you are addressing in your post, is the tricky part.

There are several optimized approaches to solve the first step. I think you approach is somewhat naive and I recommend studying the literature to achieve decent speedups.

Basically, performing the transformation of the original system to an upper triangular one can be performed by a tiling approach which, from some points of view, resembles the tiling approach which is used to perform the matrix-matrix multiplication in the classical example of the CUDA C Programming Guide.

The tiling approach can be performed either by purposely written kernels or by making massive use of cuBLAS routines.

Last month (November 2013), the following paper

Manuel Carcenac, “From tile algorithm to stripe algorithm: a CUBLAS-based parallel implementation on GPUs of Gauss method for the resolution of extremely large dense linear systems stored on an array of solid state devices”, Journal of Supercomputing, DOI 10.1007/s11227-013-1043-3

has proposed a tiling/stripping approach based on the use of cuBLAS.

All the above mentioned approaches are summarized in a presentation available at M. Carcenac’s webpage entitled Application: linear system resolution with Gauss method, see http://eng.eul.edu.tr/manuel/Course_on_Advanced_GPU_computing/6–Application_linear_system_resolution_with_Gauss_method.pdf.

Furthermore, a downloadable Visual Studio 2010 project implementing all of them with some performance testing is available at the Gaussian elimination with CUDA post, see http://www.orangeowlsolutions.com/archives/721. From the available code, you can make your own tests for your architecture of interest and experience the improvements the approach by M. Carcenac is introducing with respect to the others.