Googling “small matrices” yields this forum thread fairly high up in the results list, so I’ll hazard a guess that there’s a substantial number of people interested in similar problems. I recently ran some tests on LU factorization with partial pivoting, so I thought I’d share my results.

**1. 1-thread per Matrix approach**

First and foremost, having threads each invert one matrix is a terrible, terrible idea. The CPU will dance circles around the GPU. It would appear that such an approached is doomed by the numerous reads and writes to global memory.

**2. Multiple threads per Matrix approach**

I took the advice of some members of this forum, including Ceearem above, to have one block of threads collaboratively invert one matrix. The matrix for one block is kept entirely in shared memory (this is essential). For an NxN matrix, I chose N threads per block, the threads sweeping through the matrix row by row. This works waaay better than the 1-thread per matrix approach.

Code for LU factorization of a real matrix is posted below, and extension to complex matrices is straight-forward. Unfortunately it’s highly divergent, and every time it advances one row there’s one more thread with nothing to do. I would appreciate it if anyone has suggestions on how to improve on this.

**3. GPU vs CPU results**

I used a Quadro 600 GPU, and one core of a Intel Xeon 5680 (3.3GHz). Maybe I’m being unfair in comparing an mid-class GPU to a very high-end CPU, but that’s what I could get my hands on. The test is LU decomposition of 50,000 randomly generated 32x32 matrices. The best execution time out of three runs is reported below.

Quadro 600 (CUDA kernel running in PyCUDA)

single precision LU decomposition, real matrix: 1.06s

single precision LU decomposition, complex matrix: 2.27

Xeon 5680, one core (MATLAB 2009a)

single precision LU decomposition, real matrix: 0.86s

single precision LU decomposition, complex matrix: 1.37s

Some caveats: the Quadro 600 was probably doing some stuff keeping the desktop graphics running, so competition may have been stacked against it. For the CPU, I also tried LU decomposition in Fortran, but it was actually slower than MATLAB; the reason for this is probably that MATLAB is using ATLAS and I think my LAPACK is using the netlib reference BLAS.

**4. Conclusions**

Granted that my code is probably non-optimal, but there doesn’t seem there’s any advantage to using a GPU for this sort of problem. I think I am limited primarily by the number of multiprocessors right now, so I’m sure much better performance could be attained on a more powerful GPU; Quadro 600 has 2 cores, whereas a deluxe GPGPU like the Tesla C2070 has 14 cores, so I’d expect somewhere in the vicinity of 6-7 fold improvement at best. But if I harnessed all 6 cores of the Xeon via, say, OpenMP, the performance would be similar. In any event, there’s not going to be the orders-of-magnitude speedups seen with large matrix computations.

Comments and second opinions would be welcome.

**5. Code**

CUDA Kernel:

```
// ran with 32 thread blocks, blockDim.x = 1000, blockDim.y = 50
// in PyCUDA. sz=32 is the matrix dimension.
__global__ void LUfactor(float *AG, int *PG) {
int bx = blockIdx.x + blockIdx.y*blockDim.x;
int tx = threadIdx.x;
__shared__ float A[sz][sz];
__shared__ int piv[sz];
int m,n,tmp;
__shared__ float v;
//Collaboratively load matrix into shared memory
for(n=0; n<sz; n++){
A[n][tx] = AG[bx*sz*sz + n*sz + tx];
}
piv[tx] = tx;
__syncthreads();
//LU decomposition
for(n=0; n<sz; n++){
//pivoting
if (tx==0){
for(m=n+1; m<sz; m++) {
if( fabsf(A[piv[m]][n]) > fabsf(A[piv[n]][n]) ){
tmp = piv[n];
piv[n] = piv[m];
piv[m] = tmp;
}
}
v = 1/A[piv[n]][n];
}
__syncthreads();
//L block normalization
if (tx>n) {
A[piv[tx]][n] *= v;
}
__syncthreads();
//reduction
for (m=n+1; m<sz; m++) {
if (tx>n) {
A[piv[m]][tx] -= A[piv[m]][n]*A[piv[n]][tx];
}
}
}
//copy back to global memory
for(n=0; n<sz; n++){
AG[bx*sz*sz + n*sz + tx] = A[n][tx];
}
PG[bx*sz + tx] = piv[tx];
}
```

MATLAB code

```
A = rand(32,32,50000,'single');
A = rand(32,32,50000,'single') + single(i)*rand(32,32,50000,'single');
tic;
for n = 1:50000
A(:,:,n) = lu(A(:,:,n));
end
toc
```