Use of Different Solvers on Both Host and Device


I am recently working on a numerical acoustics project with traditional boundary element method. The solution to the problem can be divided in the following two parts:

  1. Compute a matrix A. This will be a large dense complex matrix with size around 10000*10000. It is not guaranteed that this is a square matrix, because for some special frequencies, additional points will be added. Every element of this matrix has to be independently computed from geometry of an object, but since the data for each element is independent from that of the others, GPU can be used to facilitate this process.

  2. Solve a least square problem of min||Ax-b||. Since the matrix A has a size that is way much bigger than what can be allocated on GPU, CUDA provided libraries(cuSolver especially) are not applicable, or at least not efficient. What I plan to do is to solve this problem independently on host because the size of this problem can be handled by virtual memory. However, I don’t know if there is any BLAS library that can be used in compatibility with CUDA. I tried EIGEN, but the build failed.

In my mind, using CUDA must not necessarily mean you will have to abandon all host based libraries. But does anyone have experience with using a host library with CUDA together? Thank you for reading this. I appreciate your interest in this problem.


What exactly do you mean by compatibility? The use of CUDA does not preclude the use of host-side libraries that are being invoked by host code. In case these are template libraries with platform-specific code (e.g. SIMD intrinsics) it is recommended to process such host code directly with the host tool chain, and use the CUDA tool chain only for the host code that directly interfaces to the GPU.

I am not familiar with your area of expertise (acoustics processing) but a dense least-squares system of dimension ~10K seems unusual. Is this the best known approach to solving the underlying computational problem? In your field, are least-squares problems of this size routinely solved on CPUs? From a numerical standpoint, is the overall process generally robust (is the problem well conditioned)?