Anyone ever used CUBLAS? Changes in implementation...

Fewmitz · February 12, 2012, 6:26pm

I’m trying to implement a simple program using CUBLAS, but I noticed something changed with the recent CUBLAS library. I think before version 4, CUBLAS was basically the BLAS library but converted to device functions so you could run them on the core(s), but when I read through the new documentation (4 something) it looks like they’ve converted it so you call CUBLAS functions from the host and they just execute on the device.

What I’m trying to do is a large sequence of matrix operations, and the way I originally wanted/needed to implement it was to give each thread the input matrices and have them all churn through the algorithm. But since it looks like CUBLAS has to be called from the host (and it doesn’t auto-parallelize across the device so I don’t know why they did this in the first place), I can’t call them in the kernel, so I don’t know how to do what I want other than write my own device functions to take care of the matrix multiplications, transposes, etc.

njuffa · February 12, 2012, 6:36pm

All CUBLAS functions are invoked from the host. They cannot be called from device code. This has not changed from the very first release of CUBLAS through CUDA 4.1.

To work efficiently with CUBLAS, one copies down data from the host to the device, applies a series of CUBLAS operations on the data now resident on the device, then copies back the data from device to host at the end. Of course CUBLAS operations can be used interspersed with custom kernels as desired. Note that many BLAS (and thus CUBLAS) operations allow for the optional (and implicit) transposition of input matrices, so explicit transposition of matrices is rarely required. However if you do need to transpose a matrix, there is a whitepaper available on how to do that efficiently:

Fewmitz · February 12, 2012, 8:31pm

All CUBLAS functions are invoked from the host. They cannot be called from device code. This has not changed from the very first release of CUBLAS through CUDA 4.1.

To work efficiently with CUBLAS, one copies down data from the host to the device, applies a series of CUBLAS operations on the data now resident on the device, then copies back the data from device to host at the end. Of course CUBLAS operations can be used interspersed with custom kernels as desired. Note that many BLAS (and thus CUBLAS) operations allow for the optional (and implicit) transposition of input matrices, so explicit transposition of matrices is rarely required. However if you do need to transpose a matrix, there is a whitepaper available on how to do that efficiently:

http://developer.download.nvidia.com/compute/DevZone/C/html/C/src/transpose/doc/MatrixTranspose.pdf

I understand this, but because the calls are made from the host, doesn’t that limit the way in which you parallelize the program?

E.g. Say I have something like 10000 matrices and each matrix has another matrix I want to multiply it with. Even after I copied over all of the data on to the GPU, wouldn’t the way CUBLAS is set up imply that I would have to call cublasSgemm 10000 times from the host?

The way I had originally envisioned this problem is having each of the matrices in a giant array, copying the whole thing into device memory and then having the kernel function map a pair of matrices to a particular thread. The way I’m understanding the CUBLAS implementation, this kind of parallelization (or any, for that matter), won’t work.

njuffa · February 12, 2012, 9:47pm

I haven’t used this myself, but my understanding is that CUBLAS in CUDA 4.1 has limited batch support, in particular for *GEMM. This would be useful in cases where many small matrices need to be multiplied. I would expect performance gains for large matrices would be minimal, one might as well do a kernel launch per matrix for these. Please take a look at the CUBLAS documentation.

On the registered developer website, we offer an example of batched solvers (with partial pivoting) and batched matrix inversion of small matrices. The approach demonstrated there may be suitable for other operations on small to very small matrices. As a rough guideline, for matrices up to about 10x10 I would suggest looking into one-matrix-per-thread approaches, with all data in registers. For slightly larger matrices, say up to 100x100, look into a matrix-per-threadblock approach, keeping matrix data in shared memory. Beyond that size, I would recommend using CUBLAS.

Topic		Replies	Views
Mixing CUDA and CUBLAS possible? Is avalaible the CUDA source code? CUDA Programming and Performance	11	12888	May 8, 2010
simple matrix (or matrix vector) multiplication using CUBLAS CUDA Programming and Performance	9	5597	November 25, 2009
Help in using CUBLAS CUDA Programming and Performance	2	2133	January 29, 2012
cuBLAS related question CUDA Programming and Performance	16	2924	February 6, 2013
CUBLAS library and kernel CUDA Programming and Performance	4	1813	November 18, 2009
How to speed-up matrix multiplication using CUBLAS? CUDA Programming and Performance	6	7491	September 1, 2010
CUBLAS VS CBLAS sgemv Benchmarking matrix-vector operations on GPU and CPU CUDA Programming and Performance	5	10037	March 24, 2014
CUBLAS matrix-vector multiplication CUDA Programming and Performance	14	10054	January 20, 2010
Newbie question about cublas CUDA Programming and Performance	10	3344	December 2, 2010
Multiple calls of one CUDA function - Matlab CUDA Programming and Performance	9	5016	March 4, 2009

Anyone ever used CUBLAS? Changes in implementation...

Related topics