is there a way to get an idea on how big the shared memory size of a certain GPU is, e.g. by representing it in the size of ‘real problems’, e.g. lets say ‘you can solve a linear equation system of N equations depending on N variables’, where N is unknown and limited by the shared memory. I want to know N e.g. for a GTX460 so that I know that for a greater N the solution will be computed (at least partially) in global memory which is slower…
There isn’t much mystery about the shared memory size - it is either 16kb or 48kb per multiprocessor, depending on what hardware you are using and how it has been configured.
Beyond that I don’t really understand the rest of the question. Shared memory is per multiprocessor scratch memory which can be used for sharing and reusing data between threads within a block. It is almost universal that an entire input data set won’t fit into shared memory, and it is also almost universal that algorithms are implemented “tile wise” or “sub domain wise” or “block wise” for this reason. The shared memory size dictates the tile/sub-domain/block size, not the maximum admissible problem size.
If you are having difficulty understanding how shared memory can be used in this way, I highly recommend several of the examples in the SDK - transpose, matrixMul, reduction, and FDTD3d. The first three have very useful papers included which describe the algorithms and the thinking behind the GPU code design. Between them you get to see four very different uses for shared memory, all of which allow data reuse and intra-block communication in this sort of “tile wise” algorithms.
What I didn’t find yet is a matrix inversion algorithm… Is there an example on matrix inversion somewhere?
I wonder about wether the matrix inversion can be done in shared memory by splitting it into tiles because there’s much more data dependency than in something like vector-reduction or matrix additions…?
You can find good CUDA versions of the three most common matrix factorization routines here (there are many others floating around too). The basic structure follows “look ahead” versions of the block factorization algorithms found in Lapack. They can be built out of level 3 blas functions like gemm and syrk. So the idea isn’t to do the whole factorization in a single kernel, but in multiple, overlapping operations to factorize the matrix a block at a time. CUBLAS includes triangular matrix solvers for solving with the result of one of these factorizations.
Okay thanks so far; is my assumption right that all the operations are refering to shared memory?
No. Shared memory only has a lifetime of a single block in a single kernel launch. Internally, some of the BLAS calls are undoubtedly using shared memory, but the matrix being factorized is in global memory. There is no other way to do it in CUDA.
Okay thanks again for your answer; that’s because the complete set of data is dependant on the other data, right?