Hello!

I am trying to solve Gaussian elimination with CUDA.

I have a matrix N*N. To get new elements of this matrix I use such code on CPU, where C.width=N:

```
for(int z=0; z< C.width-1; z++)
{
for ( int c = z+1 ; c < C.width ; c++ )
{
for (int d = z ; d < C.width ; d++ )
{
C.elements[c*C.width+d]=C.elements[c*C.width+d] - (B.elements[c*C.width+z]*C.elements[z*C.width+d]);
}
}
}
```

I am trying to implement it with CUDA. For example, N=512

dim3 dimBlock(16,16,1);

dim3 dimGrid(32,32,1);

MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

I think for every iteration should perform N-i*N threads to calculate elements

if(idx>511 || idy>510)

return;

for(int i=1; i<512;i++)

{

if(idx>=i-1 && idy>=i-1)

C.elements[(idy+1)*C.width+idx]=C.elements[(idy+1)*C.width+idx]-((C.elements[(idy+1)*C.width+(i-1)]/C.elements[(i-1)*C.width+(i-1)])*C.elements[(i-1)*C.width+idx]);

```
__syncthreads();
}
}
```

Obtained result on GPU and CPU is the same, but time to solve this always is: Time(CPU)=2*Time(GPU)

For N=512: Time(CPU) = 1900 ms Time(GPU) = 980 ms

For N=1024: Time(CPU) = 14000 ms Time(GPU) = 7766 ms

.

.

.

I think speed-up must be larger than I have. Are there any mistakes in my parallel code? Can you help me how can I rewrite my code?

Thanks for any help!