Hi there,
after spending a lot of time reading through CUDA tutorials, trying simple examples, etc., I’m finally about to speedup an existing source code on my own.
The basic structure is as follows:
for(a = 1..10)
{
// do sthg
for(b = 1..100)
{
// do sthg
for(c = 1..100)
{
// do sthg
multMatMat(A, B) // BLAS3
invMat(A) // most computational intense task
multMatVec(A, b) // BLAS2
multVecVec(a, b) // BLAS1
}
}
}
The size of matrices are 80 x 80, and the vectors involved are 80 x 1. The elements itself are “complex floats”. And I was about to use CUBLAS in the beginning.
Now, from reading through many of your posts I conclude that 80x80 is considered to be a rather small matrix in GPU world. However, I think despite these small sizes there must be a lot of speedup potential, since the forloops are executed very often (let’s say about 100.000 times for a start, but it could be millions and more).
Of course, my concern is that there are some other computations in between these loops (not only matrix operations), which might impede developing a fully parallel solution. However, the results of one such loop cycle are NOT dependent on any other execution. Therefore, there should be a lot of potential, no?
So my real questions are:

can somebody suggest a way to speedup the basic problem stated above?

I tried my luck with CUBLAS so far; what I am afraid now is that CUBLAS is probably not suited for such small matrices  so do I need to do it in “native CUDA” (specifying the grids/blocks and implementing matrix multiplication/inversion myself)  would this bring a benefit?

is somebody aware of a way to do matrix inversion (or at least LU decomposition) in CUDA (from what I see, CUBLAS cannot help me to solve this issues by default)? I read some threads about it in this forum, but none of them seem to have solved matrix inversion/LU decomposition just by using CUBLAS (or CUDA)?
Any hints are appreciated!
Rgds,
Michael