 # Optimising multiple calls to cuBLAS

Hi All,

I expect this question has been asked before but I’ve not been able to find a similar answer whilst digging through the forums and would like some advice before I go diving in to code.

I have an application in C++ using a library called Eigen to mimic the functionality provided by MatLab. I have a matrix expression similar to the form:

X = exp{ [ (a + A) * (b + B) ] ^ 2 }

Where uppercase are matrices and lowercase maybe scalars, vectors or matrices of compatible dimensions and the dimensions themselves may be large i.e. > 1000.

In a single threaded application I might write a function delegate to calculate a particular element of X and then iterate through each element of X to produce a result. But of course this is quite a chore of a large code-base with multiple formulas like the above.

In Eigen I can compose multiple matrix-matrix operations in a single expression and the library will lazily evaluate the result for me (in a similar way to LINQ), only when the result is needed. Reducing the number of temporary objects and the number of iterations to calculate X:
https://eigen.tuxfamily.org/dox/TopicLazyEvaluation.html

With which the code for my expression would be:

MatrixXd X = ( (a + A) * (b + B) * (a + A) * (b + B) ).exp();

If I now want to use the GPU, I assume I must use cuBLAS and multiple calls to gemm() to calculate the product and then the square whilst the addition and exponent operations I must handle myself separately. But this sounds highly inefficient as I am asking the GPU to run maybe 5 separate kernels whilst creating a large number of temporary intermediate matrices.

So what is best practice in this scenario or am I worrying about nothing?
Is there a way to use cuBLAS such that I only need to make one call to the GPU for a complex expression such as mine?
Should I take a similar approach to the single threaded application, i.e. ignore cuBLAS and calculate each element of X on a separate thread using my own function?

As I understand it, batching with gemmBatched() is not the answer as I’m not multiplying one array of matrices against another.