Any method faster than the Gauss elimination method?


I implemented my own CUDA kernel for solving linear algebra AX = B by using the Gauss elimination method. I solved a 1500 x 1500 matrix and it roughly takes me 1500 milliseconds. Any other faster method that I can exploit? I am using a RTX 3070 Laptop



did you try using cusolver library? You can use a google search to find many questions about solving linear systems with cusolver, including sample codes provided by NVIDIA.

Detailed questions about using cusolver or any of the CUDA math libraries should be posted on the libraries forum.

If I were working on such a problem, I definitely would not start out by writing my own code, except maybe as a learning exercise. I would investigate high-quality libraries first.

If you are asking “are there any other faster method I can exploit while writing my own kernels” I won’t be able to help there - perhaps others will have suggestions.

I started cuSolver before but I wasn’t sure about the results (Probably I did not implement it correctly, I will try again anyway). But before going forward, do you think that I can have better performance by using this library?

I think it is likely. But really I am just guessing.

Do you know if it is normal for the function cusolverDnCreate to take too long for loading?

I don’t know what that means; “too long for loading”.

My expectation is that for most overhead functions like this, it should take at most a few milliseconds.

I suggest asking questions specific to libraries on the libraries forum

By using the following code, I can remark a latency around 2 seconds. I am using Debug Mode on Visual Studio 2022 (C++/C)

cusolverDnHandle_t* cusolverH_Array = (cusolverDnHandle_t*)malloc(SimData.SelectedGPUNumber * sizeof(cusolverDnHandle_t));
	cudaStream_t* stream_Array = (cudaStream_t*)malloc(SimData.SelectedGPUNumber * sizeof(cudaStream_t));
	for (int i_SelGPU = 0; i_SelGPU < SelectedGPUNumber; i_SelGPU++)
		// For each selected GPU
		/* step 1: create cusolver handle, bind a stream */
		cusolverH_Array[i_SelGPU] = NULL;
		stream_Array[i_SelGPU] = NULL;

You’re not actually calling cusolverDnCreate in that code.

The first CUDA runtime API call per device could probably consume 300+ms, so that may be the main factor here.

The 2 seconds latency happens when passing through the line cusolverH_Array[i_SelGPU] = NULL;

I just forgot to include cusolverDnCreate in the code above but it is still giving the same latency.

Any idea on this strange behaviour?

That’s just ordinary host C code. It has nothing to do with CUDA. No I have no idea about the behavior.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.