Mid-size matrix arithmetic in Cuda Is cuda appropriate?


I have some experience writing numerical software, but none whatsoever for GPUs. I am thinking about porting the most computationally-expensive part of my code from CPU to GPU, and naturally would like to get some feel, whether GPU is the right platform for my task.

My code is solving the following problem on the CPU, iteratively (concurrently within only a few threads), about a million times. The parameter N in this description is about 50 – 100 (hence mid-size matrix). Single-precision accuracy is sufficient for my purposes.

  1. Form a NxN real symmetric positively defined matrix. This step is very quick and can be easily programmed on the GPU, with no interaction with the CPU. This matrix is different for every iteration.

  2. Invert the NxN matrix, formed at the previous step.

  3. Perform BLAS level-2 operations on the inverted matrix, mostly a multiplication by a large number (about 10xN) of vectors of length N. The vectors are different for every iteration, but can be quickly formed on the GPU, with no interaction with the CPU.

All the iterations described above can run in parallel with each other.

Based on my [very limited] understanding of NVidia architecture, I’m concluding, that handling a matrix of size 50-100 in one GPU thread is impractical: limitations on shared memory size won’t let multiprocessor run several of such threads concurrently.

At the same time, a square matrix of size 50-100 is too small to be effectively handled by the GPU, if standard CUDA BLAS is utilized.

Are these concerns valid? If yes, I would love to hear hints how this problem can be attacked and what kind of a performance gain (over modern CPUs) I should expect.

Thank you in advance for your help,


Perhaps allowed each block to handle one matrix would be a good fit. Threads in the block can communicate intermediates results through shared memory, although a single precision matrix of size 100x100 will not fit into shared memory all at once. You will have to stage pieces of the matrix in from global memory. It be easier to make coalesced reads and writes as well if you do one matrix per block.

That’s definitely a valuable idea.

Could you provide an off-the-top-of-your-head estimate of the performance I should expect from this approach?


Unfortunately, I don’t have a good sense of whether this is going to be compute-bound or memory-bound.

If each matrix entirely fit within the shared memory, I would suspect the problem would be compute-bound and you could estimate the best case performance from the number of GFLOPS you card can do. However, if you have to use global memory for intermediate results (since they won’t all fit in shared memory), then the kernel might be memory bound, and you could work out the time from the total GB/sec bandwidth of your card.