Will cuBLAS ever be completed?

I’ve found cuBLAS to be a helpful tool in my linear algebra studies, but it only implements a small subset of the full BLAS library. I think that CUDA/CUBLAS could be quite powerful (and very useful) if developed to the full functionality of the BLAS library, since higher-level tools like LAPACK could then be built on top of the cuBLAS interface.

If nVidia doesn’t ever plan to complete the cuBLAS library to allow for something like this, would they consider launching it as an open source project (on SourceForge, or similar) so that others could build on it and complete the library?

Again, (I can’t stress this enough) – the number of applications that could benefit from the use of a full cuBLAS/cuLAPACK library are staggering…plus it could really give nVidia an edge over the competition in terms of a research audience.

You can already download the sources check

Yes…I’ve looked at those…however, they are not complete. As a programmer/numerical analyst, I have experience working with these algorithms, but I’m not (by any means) an expert on CUDA. I was hoping that nVidia would complete the library, as they probably have the most experience optimizing the routines for their hardware.

LAPACK (for the most part) simply makes calls to BLAS routines; so, if nVidia could complete the library, it would be quite simple to build a LAPACK “wrapper” for cuBLAS. As far as open-sourcing the library, I don’t know what license the source is released under, and I won’t simply take the provided code and extend it without nVidia giving it the green light.

If they did open source the library, I’d like to see it under a permissive license (BSD, MIT, LGPL) so that it could actually be used in commercial applications…

EDIT: Here is the license included in the source files:

/*

 * Copyright 1993-2008 NVIDIA Corporation.  All rights reserved.

 *

 * NOTICE TO USER:   

 *

 * This source code is subject to NVIDIA ownership rights under U.S. and

 * international Copyright laws.  

 *

 * This software and the information contained herein is being provided 

 * under the terms and conditions of a Source Code License Agreement.     

 *

 * NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE

 * CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR 

 * IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH

 * REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF

 * MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.

 * IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,

 * OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS

 * OF USE, DATA OR PROFITS,  WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE

 * OR OTHER TORTIOUS ACTION,  ARISING OUT OF OR IN CONNECTION WITH THE USE

 * OR PERFORMANCE OF THIS SOURCE CODE.  

 *

 * U.S. Government End Users.   This source code is a "commercial item" as 

 * that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of

 * "commercial computer  software"  and "commercial computer software 

 * documentation" as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995)

 * and is provided to the U.S. Government only as a commercial end item.

 * Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through

 * 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the 

 * source code with only those rights set forth herein.

 */

The problem with just using LAPACK with a cuBLAS backend is that the memcopies from host->device->host will severely limit performance. The matrix algorithms in LAPACK really need to be rebuilt with CUDA optimizations in mind if you want good performance.

You might be interested in FLAME: http://www.cs.utexas.edu/users/flame/ It uses an automatic matrix algorithm generation system to optimally generate matrix algorithms on multi-core CPUs and CUDA devices. The performance numbers they presented at NVISION were quire impressive.

Edit: I should add that some of the FLAME benchmarks presented were using all 4 GPUs in a Tesla S870, so it must work with multi-gpu systems too…

E.D. Riedijk, this is CUDA 1.1, unfortunately. No double precision support etc. I don’t know of any more recent download, anyone?

http://developer.download.nvidia.com/compu…_CUBLAS_2.0.zip
doesn’t work :)

The FLAME project is indeed cool, especially since these folks have a reputation of writing lightning-fast code (I’ve been using the GotoBLAS on singlecore CPUs and it’s outperforming ATLAS, MKL etc by far. I’m a BLAS1-only guy though)

dom

does nVidia have any plans on CUBLAS support for double precision or it is left to open source community? I am no expert on CUDA and idea is maybe dumb but … is it possible to search and replace float -> double in CUBLAS source. I would do it myself, but i dont know how to make lib file form these new sources. Can anyone help me/us on this in nVidia ?

Almost. Doubles will use up double the shared mem, and so you’d have to adjust for that. Should be pretty easy though, just changing the blocking factor. A couple other tweaks like that would probably be needed elsewhere.

CUBLAS really needs to be overhauled anyway. Am I correct in thinking that vvolkov’s big speedups were never incorporated?

According to Volkov’s upcoming paper, some of his optimizations have already been incorporated into CUBLAS:
http://scyourway.nacse.org/conference/view/pap341

No you are not correct, vvolkov’s speedups are part of CUBLAS at this time. Also reading the documentation of CUBLAS as installed when installing 2.0, I see double precision functions, so as far as I know, doubles are already supported (at least for some functions)

I would still like to see nVidia implement the entire BLAS interface for the next release of CUBLAS. Doing this and adding some common LAPACK routines like _gesv (Solving a general linear equation) and _gesvd (Singular Value Decomposition) would really make the whole library a ton more useful, as it could then be ‘dropped-in’ to existing code that makes use of BLAS. The two LAPACK routines I mentioned are probably the most widely used, and would be very useful to anyone working with optimization, signal analysis, etc.

After reading some articles about this (and specifically, Volkov’s May 2008 article on matrix factorizations on GPUs), does nVidia plan to release a ‘cuLAPACK’ library that incorporates his LU/QR/Cholesky factorizations? Apparently, Volkov was able to reach 98% efficiency (of the theoretical peak speed) with his code. I would love to get my hands on that…throw in an SVD algorithm, and I’ll be in heaven!

Has any progress been made since about a year ago? Are CUDA routines available to e.g. calculated a determinant or an inverse?

http://gpgpu.org/2009/08/23/culatools-lapack

It is now almost a year since profquail made the first posting in this thread. The nVidia BLAS offering is still far from complete. It is particularly lacking in complex functionality. The same can be said of all the alternative sources - including CULA and PGI, who only have single real products right now.

MMB

That looks promising

http://icl.cs.utk.edu/magma/index.html
The Magma project, wich have Volkov himself as collaborator looks also very promising.
They have already an alfa release of the library with single and double precision LU, QR, and Cholesky (linux binaries only, no sources)
(volkov original code that was posted in this forum was for single precision only). The algorithm they are using are more advanced
and they claim a significant speedup on Volkov code, especially for small matrix sizes (see “Towards Dense Linear Algebra for
Hybrid GPU Accelerated Manycore Systems” page 10 (http://icl.cs.utk.edu/news_pub/submissions/tdb.pdf )