Hi,
I have some experience writing numerical software, but none whatsoever for GPUs. I am thinking about porting the most computationallyexpensive part of my code from CPU to GPU, and naturally would like to get some feel, whether GPU is the right platform for my task.
My code is solving the following problem on the CPU, iteratively (concurrently within only a few threads), about a million times. The parameter N in this description is about 50 – 100 (hence midsize matrix). Singleprecision accuracy is sufficient for my purposes.

Form a NxN real symmetric positively defined matrix. This step is very quick and can be easily programmed on the GPU, with no interaction with the CPU. This matrix is different for every iteration.

Invert the NxN matrix, formed at the previous step.

Perform BLAS level2 operations on the inverted matrix, mostly a multiplication by a large number (about 10xN) of vectors of length N. The vectors are different for every iteration, but can be quickly formed on the GPU, with no interaction with the CPU.
All the iterations described above can run in parallel with each other.
Based on my [very limited] understanding of NVidia architecture, I’m concluding, that handling a matrix of size 50100 in one GPU thread is impractical: limitations on shared memory size won’t let multiprocessor run several of such threads concurrently.
At the same time, a square matrix of size 50100 is too small to be effectively handled by the GPU, if standard CUDA BLAS is utilized.
Are these concerns valid? If yes, I would love to hear hints how this problem can be attacked and what kind of a performance gain (over modern CPUs) I should expect.
Thank you in advance for your help,
Cudesnick.