Anyone ever used CUBLAS? Changes in implementation...

I’m trying to implement a simple program using CUBLAS, but I noticed something changed with the recent CUBLAS library. I think before version 4, CUBLAS was basically the BLAS library but converted to device functions so you could run them on the core(s), but when I read through the new documentation (4 something) it looks like they’ve converted it so you call CUBLAS functions from the host and they just execute on the device.

What I’m trying to do is a large sequence of matrix operations, and the way I originally wanted/needed to implement it was to give each thread the input matrices and have them all churn through the algorithm. But since it looks like CUBLAS has to be called from the host (and it doesn’t auto-parallelize across the device so I don’t know why they did this in the first place), I can’t call them in the kernel, so I don’t know how to do what I want other than write my own device functions to take care of the matrix multiplications, transposes, etc.

All CUBLAS functions are invoked from the host. They cannot be called from device code. This has not changed from the very first release of CUBLAS through CUDA 4.1.

To work efficiently with CUBLAS, one copies down data from the host to the device, applies a series of CUBLAS operations on the data now resident on the device, then copies back the data from device to host at the end. Of course CUBLAS operations can be used interspersed with custom kernels as desired. Note that many BLAS (and thus CUBLAS) operations allow for the optional (and implicit) transposition of input matrices, so explicit transposition of matrices is rarely required. However if you do need to transpose a matrix, there is a whitepaper available on how to do that efficiently:

http://developer.download.nvidia.com/compute/DevZone/C/html/C/src/transpose/doc/MatrixTranspose.pdf

I understand this, but because the calls are made from the host, doesn’t that limit the way in which you parallelize the program?

E.g. Say I have something like 10000 matrices and each matrix has another matrix I want to multiply it with. Even after I copied over all of the data on to the GPU, wouldn’t the way CUBLAS is set up imply that I would have to call cublasSgemm 10000 times from the host?

The way I had originally envisioned this problem is having each of the matrices in a giant array, copying the whole thing into device memory and then having the kernel function map a pair of matrices to a particular thread. The way I’m understanding the CUBLAS implementation, this kind of parallelization (or any, for that matter), won’t work.

I haven’t used this myself, but my understanding is that CUBLAS in CUDA 4.1 has limited batch support, in particular for *GEMM. This would be useful in cases where many small matrices need to be multiplied. I would expect performance gains for large matrices would be minimal, one might as well do a kernel launch per matrix for these. Please take a look at the CUBLAS documentation.

On the registered developer website, we offer an example of batched solvers (with partial pivoting) and batched matrix inversion of small matrices. The approach demonstrated there may be suitable for other operations on small to very small matrices. As a rough guideline, for matrices up to about 10x10 I would suggest looking into one-matrix-per-thread approaches, with all data in registers. For slightly larger matrices, say up to 100x100, look into a matrix-per-threadblock approach, keeping matrix data in shared memory. Beyond that size, I would recommend using CUBLAS.