Non-negative least squares (NNLS) solver for many overdetermined small dense systems

I would like to find an existing GPU library / algorithm that can solve relatively small overdetermined dense least squares systems with a non-negativity constraint. For example Ay = b where A is m x n, y is n x 1 and b is m x 1. In my case I might have m ~= 100 and n < 10. The number of systems like this could be > 10^6.

If I didn’t have the non-negativity constraint on the values in x, it appears I could solve this using cuSolver or I could use cuBLAS to do a QR decomposition. But, with the non-negativity constraint I need an iterative method to solve these systems. I would appreciate any help pointing me to an existing library or algorithm for solving this type of system on the GPU using CUDA with C/C++.

I am not aware of a publicly available CUDA-accelerated implementation of batched NNLS.

I looked into a very similar scenario some years back (smaller m, as I recall) and managed to program this straight from the algorithm given by Lawson and Hanson in their book “Solving Least Squares Problems”. It took me about two weeks to get my CUDA implementation to run well.

The not so good news is that even with some algorithmic enhancements for a GPU environment I was not able to beat the performance achieved with the Fortran code from Netlib ( when compiled with the Intel Fortran compiler at maximum optimization with multi-threading.

In your case the m is bigger and GPU caches have grown in size since I tried this, so there is a reasonable chance you might fare better. If your processing chain has already been ported to the GPU, and this is only one stage of it, it would likely be worthwhile to run NNLS on the GPU even if the performance isn’t so great, to avoid data movement between host system and GPU.

Relevant literature (I haven’t read the article, only the abstract):

Yuancheng Luo and Ramani Duraiswami, “Efficient parallel nonnegative least squares on multicore architectures”, SIAM J. Sci. Comput., Vol. 33, No. 5, 2011, pp. 2848-2863

To achieve improved performance, we use parallelizable procedures to efficiently update and downdate the QR factorization of the matrix at each iteration, to account for inserted and removed columns. We use a reordering strategy of the columns in the decomposition to reduce computation and memory access costs. We consider graphics processing units (GPUs) as a new mode for efficient parallel computations and compare our implementations to that of multicore CPUs.

1 Like

Thanks for your messages njuffa. I did take a peek at the Lawson-Hanson algorithm before posting here but I guess I thought since this is likely a common type of system that is routinely used that there would be some newer algorithms since then (almost 50 years?). Thank you also for the link to the paper from 2011, that looks like it really covers the case I’m working with. I was all excited since the paper includes a link to some GPU source code but alas the link gave me a 404. I guess I will pick through this paper and try to understand exactly what they are doing. Thanks again.

The webpage was archived, but not the code attachments:

Feel free to contact the author if you haven’t yet, maybe he still has what was on his university webpage years ago:

1 Like