Matrix calculation on device

Hi there, my first question so thanks for your patience!

I am trying to perform a matrix calculation on the device. I am using Cuda 10 and if I could I would call CuBLAS on device but alas that has been depreciated…

The code segment which is performing the matrix is:

for (int km = 0; km < CELLS; km++) 
{
    sum = sum + K[thid][km]*CsMP[km];
}

Where K is a n*n matrix and CsMP is an n length vector.

Any suggestions?

For operations on batches of small matrices it may be useful to roll your own code.

For our internal use we wrote warp synchronous code (for up to 32 matrix width) that stores the matrix columns in variables local to the threads of a warp. This reduces register pressure and allows several matrices to be held in registers at the same time. For common matrix operations we managed to get all loops are fully unrolled, as our code uses template arguments for the matrix dimensions. Unfortunately warp synchronous programming has been deprecated with Volta and Turing, so our library will have to be revised to be future proof.

A generic device side API for matrix operations might be hard to come by. It is just too hard to optimize CUDA matrix operations for a wide range of use cases and matrix sizes while providing a single API.