# Nonlinear least-squares?

Hi,

Have any CUDA samples been published/posted for nonlinear least-squares regression (Nelder-Mead, Levenbergâ€“Marquardt, Gauss-Newton, Simulated annealing, etc)?

I have a problem with ~10^7 independent nonlinear regression tasks, each of which is small (between 3-16 floats being compared to a 2-parameter model). Seems perfect for GPU computing, no? Any help would be appreciated!

David

Sounds pretty good. Try having one thread do each task, or 3-16 threads per task, each thread performing each subpart of the task. Then align the tasks in memory, so that you can read/write quickly. Basically you want a stride of 16 floats, so you may want some extra “space” between small tasks.

Haven’t heard of any publications like you ask, but I just wanted to chime in and tell you that it looks promising. At least from 30 thousand feet.

Btw – Approximately what does the task calculation look like? How big is all the task data? Are all the tasks independent?

Thanks for the thoughts, kretleifur! Would be great if someone out there could port an existing parallel implementation to CUDA…

A typical problem would take a stack of 16 1-megapixel float images (64MB total), and solve the above inverse problem on each 16-pixel (64 byte) “column” independently. Additionally, I may have 2-8 sets of this data (128MB-512MB) to solve in parallel.

If you have code of an existing parallel implementation, it is probably not too difficult to port.

Did you make any progress with this problem?

I am currently working on a similar problem in medical imaging - basically I have a set of roughly 10^5 vectors of length 200 and need to fit a model to each of these vectors.

In principle, this should be ideally suited for CUDA, but one of my problems is, that the library I use for this task on the CPU makes heavy use of function pointers, so I can’t port it easily to the GPU.

So if anybody has a Levenberg-Marquardt implementation in CUDA, I’d be very glad to hear from him.

Best regards,

Michi

I’m just chiming in to say I would also like to to see someone port some of these algorithms to CUDA.

With my current employment conditions though, if I were to write it even in my free time it would become company property. :P

Well, I can’t offer anything for this yet, but as part of a school (university) related project, I’m implementing a mathematics library in C# that is focusing on linear algebra and optimization algorithms. Once the library is completed in C#, I’m planning to port as many algorithms to CUDA as possible to seamlessly speed up the computations when a CUDA-compatible card is detected in the machine.

Like I said, I don’t really have a timeframe for this to be done, other than to say that I’ll need it for use in another project within a few months, so it shouldn’t be longer than that. Also, the complete source code (C# and CUDA) will be released under the BSD license, so you’ll be able to grab it and use it in whatever you like. I’ll post back here in the future when I have something to share…

Was wondering if anybody actually ported an implementation to CUDA. I am planning on writing an implementation of the Levenberg Marquardt algorithm myself, but it’d be nice to know if someone else has done it. If i do complete it, I’ll post a link here.

Was wondering if anybody actually ported an implementation to CUDA. I am planning on writing an implementation of the Levenberg Marquardt algorithm myself, but it’d be nice to know if someone else has done it. If i do complete it, I’ll post a link here.

Is Levenberg Marquardt a search algoritm or does it calculate the best fit directly?

Is Levenberg Marquardt a search algoritm or does it calculate the best fit directly?

http://en.wikipedia.org/wiki/Levenberg%E2%…uardt_algorithm
http://www.ics.forth.gr/~lourakis/levmar/
Use the lourakis source and replace the blas-lapack libraries with cublas, magma or whatever works;
alternatively,
use lourakis without libraries and replace the matrix stuff with material from the SDK etc.
That way, one should be able to keep working code in each step towards a final cuda version where the data is no longer unnecessarily swapped between host and gpu.
As function pointers go, afaik templates will help at least to a point, see the discussion in http://forums.nvidia.com/lofiversion/index.php?t87781.html.
A platform solution seems to exist (mainly) for Linux, http://www.plm.eecs.uni-kassel.de/CuPP/.
Kernels that have to be called through a functionpointer, can be supplied with an ordinary C function which calls the kernel directly; the C function can obviously be called by pointer.

http://en.wikipedia.org/wiki/Levenberg%E2%…uardt_algorithm
http://www.ics.forth.gr/~lourakis/levmar/
Use the lourakis source and replace the blas-lapack libraries with cublas, magma or whatever works;
alternatively,
use lourakis without libraries and replace the matrix stuff with material from the SDK etc.
That way, one should be able to keep working code in each step towards a final cuda version where the data is no longer unnecessarily swapped between host and gpu.
As function pointers go, afaik templates will help at least to a point, see the discussion in http://forums.nvidia.com/lofiversion/index.php?t87781.html.
A platform solution seems to exist (mainly) for Linux, http://www.plm.eecs.uni-kassel.de/CuPP/.
Kernels that have to be called through a functionpointer, can be supplied with an ordinary C function which calls the kernel directly; the C function can obviously be called by pointer.

hi!

has been a while since the last posting here…

I was wondering if anyone knows, if there is already a nonlinear least-square implementation for cuda somewhere available?

Thanks in advance for any hint :)

I’m porting cminpack, which includes lmder and lmdif

(levenberg-marquardt optimization for know Jacobian and aproximated Jacobian) to CUDA.

In particular those functions are ported and tested OK for computing capability 1.1 (with float -single precision- and

without pointers to functions).

I would like it to support double-precision and pointers to functions but I have not a GPU card to test and

doesn’t find any proper way to simulate, say, capability 2.0

Any idea?

Hi,

I’m also looking into a non-linear least squares implementation on GPU.

I have just ordered my GTX680, that comes with the latest CUDA capability 3.0. I would be happy to help test your code on my card when it arrives and is installed.

There does not seem to be any GPU package out there for parallel continuous minimization. The only thing I could find is this article where section 6.3 covers the Weierstrass Continuous function. The authors are part of the so called ParadisEO project, that just put their GPU version of the code online, but I was not able to find the continuous case on their svn.

Sure hope this thread kicks back to life :)

Do you plan to share this work as open source software? If so, I’d be very interested in checking out what you have so far. I have compute capability 2.0 and above devices that I could test on.

Taylor

Hi Taylor,

The forums are back up again and my card has arrived (I’ve been playing around a bit with it meanwhile).
If you - and others, of course - are interested to develop this into an open source project, count me in.

I have experience working with optimization problems and I have started with the CUDA documentation. If I would do it for myself (which I still plan to do), I would start with levmar for known jacobian, box constrained and using PyCUDA. Single-float is sufficient for my purposes.

I haven’t worked with cminpack, but I just saw it has a less restrictive license, which should add some momentum if some of you are working on commercial software.

Also, if something similar has been started somewhere else, please mention it here.

Cheers,
Ken

An efficient parallel Levenberg-Marquardt fitting libary has been published on Plos One on in Oct 2013 at http://www.plosone.org/article/info:doi/10.1371/journal.pone.0076665. It is titled “Efficient Parallel Levenberg-Marquardt Model Fitting towards Real-Time Automated Parametric Imaging Microscopy.” I think that the people reading this page will find the work in this paper is helpful.

The source code is publicly available at https://github.com/zhangdianwen/GPU-LMFit.