"Beta of LAPACK optimized for CUDA GPUs available"

I’ve receive the following announcement in my mailbox from websupport@nvidia.com:


Subject: NVIDIA Online Update: Beta of LAPACK optimized for CUDA GPUs available

A new Driver Release Note just been posted to your web site.

This Release Note is called: Beta of LAPACK optimized for CUDA GPUs


This Release Note is version 1.

Release Note Description: EM Photonics has released a beta of LAPACK

optimized for CUDA GPUs. You can download the release from




Before I went and registered to download this, I thought I would ask if anyone knows anything about this. Where do we stand with “culatools.com”, has anyone tried this, what LAPACK routines are available, etc.? I am leery of commercial enterprises…I am already addicted to Matlab and would prefer not to be addicted to other applications that require continued fees…

I do trust that eventually a set of CUDA Lapack routines will be available.

I thought nvidia were working on LAPACK with CUDA themselves.

Hopefully MAGMA (http://icl.cs.utk.edu/magma/) can give us something useful soon

Hello, I am one of the developers of CULA and I wanted to take a moment to comment on this thread. CULA has gone from Beta to full release as of the Nvidia GTC last week. Our CULA Basic package contains LU/system-solve, QR, SVD, and two versions of least squares in single-precision real and complex. This package is, and will remain, absolutely free. Additionally, we have Premium (for internal/personal use) and Commercial versions that include many more routines (24 total and counting) in all four precisions along with upgraded support.

This project is the result of a very tight collaboration with Nvidia, who provided significant input and assistance. In many ways, this is their vision of a LAPACK library for CUDA GPUs and I don’t think they would have worked with us if they were developing their own version (to address gshi’s comment.)

For Boxed Cylon, I understand the desire to avoid paying ongoing commercial fees, and as such we have tried to keep the cost as low as possible in addition to providing a free version with what we feel are very high value and polished routines. It’s worth noting that any ongoing fees are only for support and updates; any purchased copies will continue working indefinitely. If you would like to see user opinions, our forums (http://www.culatools.com/forums) are completely unfiltered and contain a number of posts from satisfied users - I invite you to come check them out.

Now that we have completed our Beta, had our general launch, and attended GTC we are excited to engage the community in a dialog.

Thank you for reading,

CULAtools Team


I beta tested cula and have now downloaded the basic version.

I was very impressed with the results that i was getting. I was mainly interested in doing QR decompositions and Single Value Decompositions. If i included memory transfer times i got a 6-8x times speedup when doing QRD:s. This was using a GTX260 card against a intel core duo at 3 GHZ running intels optimized Math Kernel Library.

CULA is very easy to use, they have both a “c-interface” mode and a device mode.

From looking at CULA:s forums there have been som bugs etc, but it seems these are getting sorted.

Anyways this was all positive stuff and I might seem like a cula fanboy… Well if I were to complain at anything it would be that i couldn’t find a good api reference or many device mode examples at this early stage… But since the api follows lapack standard calling conventions its not terrible difficult to figure things out.

Great job JohnH!

Hi again,

Found an old test run, i think this one was done using intels MKL using both cores. I guess this could give you some basic idea, remember this includes the memory transfer times. Also i believe quite a few of these were improved for the full release…

– SGEQRF Benchmark –

Size CULA (s) MKL (s) Speedup

4096 0.60 2.66 4.3939
5120 0.97 5.02 5.1807
6144 1.57 8.73 5.5748
7168 2.37 13.36 5.6350
8192 3.45 19.95 5.7745

 -- SGETRF Benchmark  --

Size CULA (s) MKL (s) Speedup

4096 0.47 1.57 3.3462
5120 1.05 2.93 2.7777
6144 1.22 4.76 3.8952
7168 1.78 7.37 4.1305
8192 3.26 10.92 3.3504

 -- SGELS Benchmark  --

Size CULA (s) MKL (s) Speedup

4096 0.77 2.84 3.6826
5120 1.32 5.30 4.0259
6144 2.04 8.89 4.3607
7168 3.02 13.83 4.5799
8192 4.29 20.35 4.7462

 -- SGGLSE Benchmark  --

Size CULA (s) MKL (s) Speedup

4096 1.07 6.89 6.4155
5120 1.83 11.68 6.3866
6144 2.88 17.98 6.2448
7168 4.26 26.42 6.1963
8192 6.12 37.31 6.1006

 -- SGESVD Benchmark  --

Size CULA (s) MKL (s) Speedup

4096 41.04 162.00 3.9475
5120 72.77 293.89 4.0383
6144 106.25 501.40 4.7192
7168 154.97 775.61 5.0049
8192 Host side allocation error.

 -- SGESV Benchmark  --

Size CULA (s) MKL (s) Speedup

4096 0.63 1.58 2.5021
5120 1.30 2.94 2.2579
6144 1.57 4.77 3.0342
7168 2.26 7.38 3.2636
8192 3.87 10.96 2.8319