Non-negative least squares (NNLS) solver for many overdetermined small dense systems

MoreCoresForMePlz · October 13, 2021, 6:43pm

I would like to find an existing GPU library / algorithm that can solve relatively small overdetermined dense least squares systems with a non-negativity constraint. For example Ay = b where A is m x n, y is n x 1 and b is m x 1. In my case I might have m ~= 100 and n < 10. The number of systems like this could be > 10^6.

If I didn’t have the non-negativity constraint on the values in x, it appears I could solve this using cuSolver or I could use cuBLAS to do a QR decomposition. But, with the non-negativity constraint I need an iterative method to solve these systems. I would appreciate any help pointing me to an existing library or algorithm for solving this type of system on the GPU using CUDA with C/C++.

njuffa · October 13, 2021, 9:06pm

I am not aware of a publicly available CUDA-accelerated implementation of batched NNLS.

I looked into a very similar scenario some years back (smaller m, as I recall) and managed to program this straight from the algorithm given by Lawson and Hanson in their book “Solving Least Squares Problems”. It took me about two weeks to get my CUDA implementation to run well.

The not so good news is that even with some algorithmic enhancements for a GPU environment I was not able to beat the performance achieved with the Fortran code from Netlib (http://www.netlib.org/lawson-hanson/) when compiled with the Intel Fortran compiler at maximum optimization with multi-threading.

In your case the m is bigger and GPU caches have grown in size since I tried this, so there is a reasonable chance you might fare better. If your processing chain has already been ported to the GPU, and this is only one stage of it, it would likely be worthwhile to run NNLS on the GPU even if the performance isn’t so great, to avoid data movement between host system and GPU.

njuffa · October 13, 2021, 11:24pm

Relevant literature (I haven’t read the article, only the abstract):

Yuancheng Luo and Ramani Duraiswami, “Efficient parallel nonnegative least squares on multicore architectures”, SIAM J. Sci. Comput., Vol. 33, No. 5, 2011, pp. 2848-2863

To achieve improved performance, we use parallelizable procedures to efficiently update and downdate the QR factorization of the matrix at each iteration, to account for inserted and removed columns. We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs. We consider graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our implementations to that of multicore CPUs.

MoreCoresForMePlz · October 14, 2021, 2:48pm

Thanks for your messages njuffa. I did take a peek at the Lawson-Hanson algorithm before posting here but I guess I thought since this is likely a common type of system that is routinely used that there would be some newer algorithms since then (almost 50 years?). Thank you also for the link to the paper from 2011, that looks like it really covers the case I’m working with. I was all excited since the paper includes a link to some GPU source code but alas the link gave me a 404. I guess I will pick through this paper and try to understand exactly what they are doing. Thanks again.

vacaloca · October 18, 2021, 3:17pm

The webpage was archived, but not the code attachments:
https://web.archive.org/web/20151023230458/http://www.cs.umd.edu/~yluo1/Projects/NNLS.html

Feel free to contact the author if you haven’t yet, maybe he still has what was on his university webpage years ago:
https://www.linkedin.com/in/yuancheng-mike-luo-a607b07b/

Topic		Replies	Views
cuSOLVER QR decomposition solving with non-square matrices CUDA Programming and Performance	10	5430	September 6, 2017
CUBlas and very large matrices CUDA Programming and Performance	3	809	September 30, 2019
cuSOLVER solving a overdetermined linear equation system GPU-Accelerated Libraries cusolver	8	1252	September 19, 2021
How to Solve A Very Large Dense Least Square Problem with Cuda GPU-Accelerated Libraries	3	2251	December 2, 2016
Is there any non-linear least squares optimization library which could support CUDA C++? GPU-Accelerated Libraries	0	468	July 13, 2020
CUDA Alternative of MATLAB lsqnonlin CUDA Programming and Performance cuda , matlab	3	1518	October 12, 2021
Parallel Direct Solvers with cuSOLVER: Batched QR Technical Blog	2	389	December 26, 2015
cuSPARSE to solve multiple independent sparse linear systems in parallel GPU-Accelerated Libraries	4	2159	March 3, 2014
Linear Algebra Solvers CUDA Programming and Performance	20	19117	February 7, 2009
Solving a Linear System or a Least Square Problem on Multiple GPUs CUDA Programming and Performance	13	829	September 12, 2019

Non-negative least squares (NNLS) solver for many overdetermined small dense systems

Related Topics