Before I went and registered to download this, I thought I would ask if anyone knows anything about this. Where do we stand with “culatools.com”, has anyone tried this, what LAPACK routines are available, etc.? I am leery of commercial enterprises…I am already addicted to Matlab and would prefer not to be addicted to other applications that require continued fees…
I do trust that eventually a set of CUDA Lapack routines will be available.
Hello, I am one of the developers of CULA and I wanted to take a moment to comment on this thread. CULA has gone from Beta to full release as of the Nvidia GTC last week. Our CULA Basic package contains LU/system-solve, QR, SVD, and two versions of least squares in single-precision real and complex. This package is, and will remain, absolutely free. Additionally, we have Premium (for internal/personal use) and Commercial versions that include many more routines (24 total and counting) in all four precisions along with upgraded support.
This project is the result of a very tight collaboration with Nvidia, who provided significant input and assistance. In many ways, this is their vision of a LAPACK library for CUDA GPUs and I don’t think they would have worked with us if they were developing their own version (to address gshi’s comment.)
For Boxed Cylon, I understand the desire to avoid paying ongoing commercial fees, and as such we have tried to keep the cost as low as possible in addition to providing a free version with what we feel are very high value and polished routines. It’s worth noting that any ongoing fees are only for support and updates; any purchased copies will continue working indefinitely. If you would like to see user opinions, our forums (CULA • Index page) are completely unfiltered and contain a number of posts from satisfied users - I invite you to come check them out.
Now that we have completed our Beta, had our general launch, and attended GTC we are excited to engage the community in a dialog.
I beta tested cula and have now downloaded the basic version.
I was very impressed with the results that i was getting. I was mainly interested in doing QR decompositions and Single Value Decompositions. If i included memory transfer times i got a 6-8x times speedup when doing QRD:s. This was using a GTX260 card against a intel core duo at 3 GHZ running intels optimized Math Kernel Library.
CULA is very easy to use, they have both a “c-interface” mode and a device mode.
From looking at CULA:s forums there have been som bugs etc, but it seems these are getting sorted.
Anyways this was all positive stuff and I might seem like a cula fanboy… Well if I were to complain at anything it would be that i couldn’t find a good api reference or many device mode examples at this early stage… But since the api follows lapack standard calling conventions its not terrible difficult to figure things out.
Found an old test run, i think this one was done using intels MKL using both cores. I guess this could give you some basic idea, remember this includes the memory transfer times. Also i believe quite a few of these were improved for the full release…