Optimising multiple calls to cuBLAS

Hi All,

I expect this question has been asked before but I’ve not been able to find a similar answer whilst digging through the forums and would like some advice before I go diving in to code.

I have an application in C++ using a library called Eigen to mimic the functionality provided by MatLab. I have a matrix expression similar to the form:

X = exp{ [ (a + A) * (b + B) ] ^ 2 }

Where uppercase are matrices and lowercase maybe scalars, vectors or matrices of compatible dimensions and the dimensions themselves may be large i.e. > 1000.

In a single threaded application I might write a function delegate to calculate a particular element of X and then iterate through each element of X to produce a result. But of course this is quite a chore of a large code-base with multiple formulas like the above.

In Eigen I can compose multiple matrix-matrix operations in a single expression and the library will lazily evaluate the result for me (in a similar way to LINQ), only when the result is needed. Reducing the number of temporary objects and the number of iterations to calculate X:
https://eigen.tuxfamily.org/dox/TopicLazyEvaluation.html

With which the code for my expression would be:

MatrixXd X = ( (a + A) * (b + B) * (a + A) * (b + B) ).exp();

If I now want to use the GPU, I assume I must use cuBLAS and multiple calls to gemm() to calculate the product and then the square whilst the addition and exponent operations I must handle myself separately. But this sounds highly inefficient as I am asking the GPU to run maybe 5 separate kernels whilst creating a large number of temporary intermediate matrices.

So what is best practice in this scenario or am I worrying about nothing?
Is there a way to use cuBLAS such that I only need to make one call to the GPU for a complex expression such as mine?
Should I take a similar approach to the single threaded application, i.e. ignore cuBLAS and calculate each element of X on a separate thread using my own function?

As I understand it, batching with gemmBatched() is not the answer as I’m not multiplying one array of matrices against another.

Thanks in advance.

no, generally there is not. For some specific expressions, there might be a “single call” cublas equivalent but you can pretty much figure that out yourself by reading what is available in the cublas manual.

CUBLAS is (mostly) a fairly straightforward implementation of a BLAS library. There are many BLAS implementations, so you could simply search for an answer for your question that would use a standard BLAS implementation. Once you have that, conversion to CUBLAS should be trivial.

But I don’t think you’ll find anyone saying you can do what you want with a single BLAS call either.

Rather than worry about whether such a thing can be done in a single call or not, I would just code up a comparative benchmark and see how BLAS/CUBLAS performs against Eigen or Matlab. I would expect both Eigen and Matlab may make use of underlying BLAS libraries.