Pseudo inverse in each thread

I have a problem where I need to calculate a pseudo inverse in each thread. The matrix is of size 200 x 20 and stored in constant memory. I first want to modify each column (differently for each thread, but using the same algorithm) and then calculate the product between the pseudoinverse of the modified matrix and a vector of size 200 x 1. This will result in a vector of size 20 x 1. Is it possible to calculate the pseudoinverse in each thread? Or I really don’t need the pseudoinverse, only the final vector of size 20 x 1. My feeling is that the matrix is too large for one GPU thread.

Try and vectorize your column modification - hopefully you can turn the problem into a form whereby you can use blockwise conjugate gradient (Which will only require 20 iterations to solve completely)

What about using dynamic parallelism, provided that you have a card with compute capabiity >= 3.5 ?

The actual calculations are not a problem, but the pseudo inverse is of size 20 x 20, thus requiring 400 registers per thread. To my knowledge, it is only possible to use 63 registers per thread unless you have a card with compute capability 3.5. The rest of the registers will spill into local memory, which will be very slow.

This is why you should vectorize the operation over multiple threads. It lets you spread that data over multiple threads, which together can store it.

Yes, but then I need to store the values in shared memory, right? 48 KB of shared memory and 400 floats for each problem means that I can only run ~30 threads on each multiprocessor, which will give terrible performance as the maximum is 1536-2048 threads per MP.

How often does the matrix change whose pseudo-inverse is to be computed? How many different matrices are there? Is it feasible to pre-compute the pseudo-inverse on the host, and then store the result in constant memory as well?

If you can give a more detailed explanation of the code, people here will almost certainly provide you with a better approach :)

There are 250 000 matrices for which I need the pseudo inverse of the modified matrix. I need to store the modified matrix and its pseudo inverse. Using floats, this would require about 3.8 GB of memory for matrices of size 200 x 20. It would also take a really long time to calculate all the pseudo inverses on the host.

I am afraid I am not catching on as to what the exact use case is here. If there are 250,000 different matrices, how are they stored in constant memory? How many right-hand sides vectors are applied to each pseudo-inverse computed?

For batches of small matrices, the choice is typically between matrix-per-thread, and matrix-per-thread block. As noted, the matrix seems to be too big for the matrix-per-thread approach. So one approach worth a try is to use a thread block of 20x20 threads, where each thread block reads the original matrix from global memory, then stores the 20x200 pseudo-inverse in shared memory. Each thread block then processes the desired number of right-hand sides with that cached pseudo-inverse. With each element a float, each pseudo-inverse requires 16,000 bytes which would allow for three concurrently running thread blocks per SM, which seems reasonable from a performance perspective.

I assume you are aware that use of the pseudo-inverse instead of a solver can cause serious numerical issues, as the computation A^T * A squares the condition number. If the final result does not require much accuracy that could still be a good approach, and it will be much faster if you are able to amortize the cost of computing the pseudo-inverse across many right-hand side vectors.

There is ONE original matrix of size 200 x 20 which is stored in constant memory. This matrix needs to be modified in 250,000 different ways, and then I need to solve a least squares problem using each modified matrix and 250,000 unique right hand sides of size 200 x 1. Thus, each pseudo inverse will only solve ONE of the 250,000 problems.

Yes, I know that the pseudo inverse is not optimal, but a QR approach requires even more registers?

From what you describe, the computation will be dominated by the computation of the pseudo-inverse. I still think that an approach that uses a thread block of 20x20 threads and 16K shared memory per thread block is the way to go. Each intermediate matrix will fit in shared memory.

If you look at the total amount of work required for the pseudo-inverse approach, QR may be an attractive alternative, and should simplify the computation. QR factorization can be done in place, using a 200x20 matrix stored in shared memory and operated on by one thread block. Since you only need to solve one right-hand side per system, you do not have to worry about how to store Q. Seems like an excellent fit.

You can simply concatenate the RHS vector to the matrix on the right, and perform QR factorization with the concatenated RHS participating in the “panel updates”. So starting out with [A|b] you wind up with [R|c] where R is upper triangular and then just need to back-solve Rx=c. For good numerical results, you would want to use either Householder reflections or Givens rotations for QR. Note how the availability of fast rsqrt() can make Givens rotation quite attractive, I have found that to be the method of choice when the aspect ratio is smaller than in your case. For a 200x20 matrix I would guess Householder reflections may be more appropriate, but I haven’t worked with matrices of that size.

But as I wrote earlier, the number of threads/pseudo inverses will be 30 per MP. For a GTX 680 this is an occupancy of 1.5%?

I am not sure how you arrived at that number? When you multiply A^T * A, the resulting matrix has 20x20 elements. If you use a 2-D thread block of 20x20 threads, where each thread computes one element of the 20x20 matrix, that is 400 threads. Three of these thread blocks can run concurrently on each SM. You would then invert the 20x20 matrix, e.g. by using Gauss-Jordan. Certainly not all of the 400 threads will be busy during this stage, but on average, half of them. The final step is to multiply the 20x20 inverse with 20x200 A^T, which again can use all threads in the thread block.

Likewise, during a QR factorization using Householder, each step involves a rank-1 update, where each element of the panel can be updated by one thread, providing ample parallelism. It is true that there will be limited parallelism in other phases of each step, e.g. the norm computation, i.e. a number of threads will be idle.