Weighted Least Squares Implementation?

I am trying to save myself a month of reinventing the wheel.

Has anyone coded the weighted least squares algorithm under CUDA? (This is a close relative to OLS, ordinary least squares.) There is a famous algorithm written by Gentleman in 1974 and published as Applied Statistics, Algorithm AS75 http://lib.stat.cmu.edu/apstat/75, which does this quite nicely. I transcoded it into C years ago, and it has served me well. Alas, I would not mind having a version that is 10 times faster in the presence of an nvidia card.

If not, has someone coded OLS? I would particularly be interested in implementations that can add observations one at a time.

any links would be highly appreciated.

sincerely,

/ivo welch