Will cuBLAS ever be completed?

jack · September 4, 2008, 2:06pm

I’ve found cuBLAS to be a helpful tool in my linear algebra studies, but it only implements a small subset of the full BLAS library. I think that CUDA/CUBLAS could be quite powerful (and very useful) if developed to the full functionality of the BLAS library, since higher-level tools like LAPACK could then be built on top of the cuBLAS interface.

If nVidia doesn’t ever plan to complete the cuBLAS library to allow for something like this, would they consider launching it as an open source project (on SourceForge, or similar) so that others could build on it and complete the library?

Again, (I can’t stress this enough) – the number of applications that could benefit from the use of a full cuBLAS/cuLAPACK library are staggering…plus it could really give nVidia an edge over the competition in terms of a research audience.

E.D_Riedijk · September 4, 2008, 3:04pm

You can already download the sources check

jack · September 4, 2008, 4:18pm

Yes…I’ve looked at those…however, they are not complete. As a programmer/numerical analyst, I have experience working with these algorithms, but I’m not (by any means) an expert on CUDA. I was hoping that nVidia would complete the library, as they probably have the most experience optimizing the routines for their hardware.

LAPACK (for the most part) simply makes calls to BLAS routines; so, if nVidia could complete the library, it would be quite simple to build a LAPACK “wrapper” for cuBLAS. As far as open-sourcing the library, I don’t know what license the source is released under, and I won’t simply take the provided code and extend it without nVidia giving it the green light.

If they did open source the library, I’d like to see it under a permissive license (BSD, MIT, LGPL) so that it could actually be used in commercial applications…

EDIT: Here is the license included in the source files:

/*

 * Copyright 1993-2008 NVIDIA Corporation.  All rights reserved.

 *

 * NOTICE TO USER:   

 *

 * This source code is subject to NVIDIA ownership rights under U.S. and

 * international Copyright laws.  

 *

 * This software and the information contained herein is being provided 

 * under the terms and conditions of a Source Code License Agreement.     

 *

 * NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE

 * CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR 

 * IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH

 * REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF

 * MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.

 * IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,

 * OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS

 * OF USE, DATA OR PROFITS,  WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE

 * OR OTHER TORTIOUS ACTION,  ARISING OUT OF OR IN CONNECTION WITH THE USE

 * OR PERFORMANCE OF THIS SOURCE CODE.  

 *

 * U.S. Government End Users.   This source code is a "commercial item" as 

 * that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of

 * "commercial computer  software"  and "commercial computer software 

 * documentation" as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995)

 * and is provided to the U.S. Government only as a commercial end item.

 * Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through

 * 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the 

 * source code with only those rights set forth herein.

 */

MisterAnderson42 · September 4, 2008, 5:17pm

The problem with just using LAPACK with a cuBLAS backend is that the memcopies from host->device->host will severely limit performance. The matrix algorithms in LAPACK really need to be rebuilt with CUDA optimizations in mind if you want good performance.

You might be interested in FLAME: [url=“http://www.cs.utexas.edu/users/flame/”]http://www.cs.utexas.edu/users/flame/[/url] It uses an automatic matrix algorithm generation system to optimally generate matrix algorithms on multi-core CPUs and CUDA devices. The performance numbers they presented at NVISION were quire impressive.

Edit: I should add that some of the FLAME benchmarks presented were using all 4 GPUs in a Tesla S870, so it must work with multi-gpu systems too…

e.ping · September 4, 2008, 11:38pm

E.D. Riedijk, this is CUDA 1.1, unfortunately. No double precision support etc. I don’t know of any more recent download, anyone?

http://developer.download.nvidia.com/compu…_CUBLAS_2.0.zip
doesn’t work :)

The FLAME project is indeed cool, especially since these folks have a reputation of writing lightning-fast code (I’ve been using the GotoBLAS on singlecore CPUs and it’s outperforming ATLAS, MKL etc by far. I’m a BLAS1-only guy though)

dom

shiggy · September 19, 2008, 4:58am

does nVidia have any plans on CUBLAS support for double precision or it is left to open source community? I am no expert on CUDA and idea is maybe dumb but … is it possible to search and replace float → double in CUBLAS source. I would do it myself, but i dont know how to make lib file form these new sources. Can anyone help me/us on this in nVidia ?

alex_dubinsky · September 19, 2008, 5:37am

Almost. Doubles will use up double the shared mem, and so you’d have to adjust for that. Should be pretty easy though, just changing the blocking factor. A couple other tweaks like that would probably be needed elsewhere.

CUBLAS really needs to be overhauled anyway. Am I correct in thinking that vvolkov’s big speedups were never incorporated?

Simon_Green · September 19, 2008, 8:14am

According to Volkov’s upcoming paper, some of his optimizations have already been incorporated into CUBLAS:
[url=“http://scyourway.nacse.org/conference/view/pap341”]http://scyourway.nacse.org/conference/view/pap341[/url]

E.D_Riedijk · September 19, 2008, 8:25am

No you are not correct, vvolkov’s speedups are part of CUBLAS at this time. Also reading the documentation of CUBLAS as installed when installing 2.0, I see double precision functions, so as far as I know, doubles are already supported (at least for some functions)

jack · September 19, 2008, 8:00pm

I would still like to see nVidia implement the entire BLAS interface for the next release of CUBLAS. Doing this and adding some common LAPACK routines like _gesv (Solving a general linear equation) and _gesvd (Singular Value Decomposition) would really make the whole library a ton more useful, as it could then be ‘dropped-in’ to existing code that makes use of BLAS. The two LAPACK routines I mentioned are probably the most widely used, and would be very useful to anyone working with optimization, signal analysis, etc.

jack · September 24, 2008, 4:32pm

After reading some articles about this (and specifically, Volkov’s May 2008 article on matrix factorizations on GPUs), does nVidia plan to release a ‘cuLAPACK’ library that incorporates his LU/QR/Cholesky factorizations? Apparently, Volkov was able to reach 98% efficiency (of the theoretical peak speed) with his code. I would love to get my hands on that…throw in an SVD algorithm, and I’ll be in heaven!

pcrs · August 24, 2009, 9:10pm

Has any progress been made since about a year ago? Are CUDA routines available to e.g. calculated a determinant or an inverse?

e.ping · August 24, 2009, 9:14pm

[url=“http://gpgpu.org/2009/08/23/culatools-lapack”]http://gpgpu.org/2009/08/23/culatools-lapack[/url]

MMB · August 24, 2009, 9:24pm

It is now almost a year since profquail made the first posting in this thread. The nVidia BLAS offering is still far from complete. It is particularly lacking in complex functionality. The same can be said of all the alternative sources - including CULA and PGI, who only have single real products right now.

MMB

pcrs · August 25, 2009, 10:58am

That looks promising

cern_freak · August 31, 2009, 7:15am

[url=“http://icl.cs.utk.edu/magma/index.html”]http://icl.cs.utk.edu/magma/index.html[/url]
The Magma project, wich have Volkov himself as collaborator looks also very promising.
They have already an alfa release of the library with single and double precision LU, QR, and Cholesky (linux binaries only, no sources)
(volkov original code that was posted in this forum was for single precision only). The algorithm they are using are more advanced
and they claim a significant speedup on Volkov code, especially for small matrix sizes (see “Towards Dense Linear Algebra for
Hybrid GPU Accelerated Manycore Systems” page 10 (http://icl.cs.utk.edu/news_pub/submissions/tdb.pdf )

Topic		Replies	Views
Drop-in Acceleration of GNU Octave Technical Blog	10	877	August 18, 2017
CULAPACK CUDA Programming and Performance	9	17955	April 30, 2009
open source CuBlas CUDA Programming and Performance	18	15115	November 24, 2017
LU factorization code CUDA Programming and Performance	45	90732	June 23, 2015
LAPACK with cublas - how and is it worth it? CUDA Programming and Performance	14	26079	April 7, 2010
Help with CUBLAS performance and timing issues, please help... CUDA Programming and Performance	1	3448	December 26, 2008
Why cublas is much slower than Matlab runs on CPU CUDA Programming and Performance	15	5002	February 10, 2011
Eigen Decomposition in future release? And other various lapack related routine CUDA Programming and Performance	7	13990	September 2, 2008
Mixing CUDA and CUBLAS possible? Is avalaible the CUDA source code? CUDA Programming and Performance	11	12895	May 8, 2010
Complete cuBLAS anytime soon? CUDA Programming and Performance	9	12092	November 18, 2009

Will cuBLAS ever be completed?

Related topics