Lots of small matrices

Tesla’s and Quadro’s are more thoroughly tested and come with support from NVIDIA. For consumers not a big deal. For (some/most) companies it is a big deal and some hours of extra work because of a non-working card already costs more than a Tesla itself.

Sure, I just felt too tired yesterday to post it.

On my 9600m GT the kernel performs best with blockDim.y=3, where it takes 0.536 seconds. For reasons entirely unclear to me, the original kernel today needs 1.50 seconds (all measurements repeated 3 times with identical results) - does the speed depend on temperature?

EDIT: After one more microoptimization we’re down to 0.509s. Although I got the feeling that a completely different approach might still be faster.

{

    int bx = blockIdx.x + blockIdx.y * blockDim.x;

    int tx = threadIdx.x;

    int ty = threadIdx.y;

__shared__ float A[sz][sz+1]; // add 1 to avoid bank conflicts

    __shared__ int  piv[sz];

volatile __shared__ float v;    // not protected by __syncthreads()

    volatile __shared__ float max_abs[2 * sz];

    volatile __shared__ int newpiv;

//Collaboratively load matrix into shared memory

    for (int n = ty; n < sz; n += blockDim.y) {

        A[n][tx] = AG[bx * sz * sz + n * sz + tx];

    }

    piv[tx] = tx; // will be executed by multiple warps, but does not matter

    __syncthreads();

//LU decomposition

    for (int n = 0; n < sz; n++) {

//pivoting

        if (ty == 0) {

            float absA = fabsf(A[piv[tx]][n]);

            if (tx < n) // || (tx >= sz)  for sz<32

                absA = -1.0f; //can never become pivot element

//use warp - synchronous programming:

            max_abs[tx] = absA;

            max_abs[tx] = max(max_abs[tx], max_abs[tx+16]);

            max_abs[tx] = max(max_abs[tx], max_abs[tx+8]);

            max_abs[tx] = max(max_abs[tx], max_abs[tx+4]);

            max_abs[tx] = max(max_abs[tx], max_abs[tx+2]);

            max_abs[tx] = max(max_abs[tx], max_abs[tx+1]);

            if (absA == max_abs[0])

                newpiv = tx;    // might be executed by multiple threads, but only one will succeed

if (tx == 0) {

                int tmp = piv[n];

                piv[n] = piv[newpiv];

                piv[newpiv] = tmp;

                v = 1.0f / A[piv[n]][n];

            }

        }

//L block normalization

        if (ty == 0) {

            if (tx > n) {

                A[piv[tx]][n] *= v;

            }

        }

        __syncthreads();

//reduction

        if (tx > n) {

            for (int m = n + 1 + ty; m < sz; m += blockDim.y) {

                A[piv[m]][tx] -= A[piv[m]][n] * A[piv[n]][tx];

            }

        }

        __syncthreads();

    }

//copy back to global memory

    for (int n = ty; n < sz; n += blockDim.y) {

        AG[bx * sz * sz + n * sz + tx] = A[n][tx];

    }

    if (ty == 0)

        PG[bx * sz + tx] = piv[tx];

}
1 Like

Okay, so with revised code (posted below), I recorded execution time vs N with CUDA and with MATLAB. Tera’s code runs at 0.44s on my system with the 50000x32x32 test, my code runs at 0.55s. Mine is slower I’m guessing because my pivot search isn’t unrolled for a N=32, and I didn’t implement the final micro-optimization (which is unpredictable when there’s two equal pivots in the column?). I found using blockDim.y=5 worked best for the 32x32 case, so I stuck with that. Altering it to blockDim.y = 6 or 7 gave somewhat better results for N=64 (1.58s vs 1.81s), but I didn’t feel inclined to tune for so many cases.

[attachment=20749:gpu_vs_cpu.png]

The general trend seems to be that the GPU has a big advantage for the smaller matrix sizes, but this vanishes at bigger matrix sizes (somewhere around N=50 the two are even). I’m guessing this reflects shared memory resources per multiprocessor getting strained by the larger matrix sizes. Would this be correct?

So, a summary so far: So Avidday has also shown that the 1-thread per matrix approach works well with interlaced matrix storage in global memory (to facilitate coalesced read/write) and calculations done in shared memory. Jimmy has shown me that the performance doesn’t just scale with the number of cores, since he got a 8x improvement over my original post using a GPU with only twice the cores; I think the clock speed is in a GTX 460m is double that of a Quadro 600, so there’s another factor of 2. Njuffa and Tera pointed out various bottlenecks in my original post, and fixing these squeezes out another factor of two or so. So 2(more cores) x 2(higher clock) x 2 (fixing my bad code) = 8, accounting for the discrepancy with Jimmy… I think.

My understanding at right now is that the GPU excels at enormous matrices and —ironically— really small matrices. For somewhat larger matrices the advantage of one over the other starts becoming unclear.

Well in any event, I think this particular forum topic has been flogged to death, barring some algorithmic cleverness above parameter tuning. Thanks again for all the info!

__global__ void LUfactor(float *AG, int *PG) {

    int bx = blockIdx.x + blockIdx.y*blockDim.x;

    int tx = threadIdx.x;

    int ty = threadIdx.y;

__shared__ float A[sz][sz + !(sz%16)];

    /* !(sz%16) is added to avoid bank conflicts during pivoting

       when the matrix size is a multiple of 16 */

    __shared__ int piv[sz];

volatile __shared__ float max_abs[sz];

    volatile __shared__ int max_idx[sz];

int m,n;

    volatile __shared__ float v;

//Collaborative load to shared memory

    for(n=ty; n<sz; n+=blockDim.y){

        A[n][tx] = AG[bx*sz*sz + n*sz + tx];

    }

    if (ty==0) {

        piv[tx] = tx;

    }

    __syncthreads();

//LU decomposition

    for(n=0; n<sz; n++){

if (ty==0){

//pivoting

            if (tx>=n){

                max_idx[tx-n] = tx;

                max_abs[tx-n] = fabsf(A[piv[tx]][n]);

            }

m = sz-n; //no. of elements to search in current col.

            while (m>1) {

                if (tx < m>>1) {

                    if (max_abs[tx + m&1] < max_abs[tx + (m+1)>>1]) {

                        max_abs[tx + m&1] = max_abs[tx + (m+1)>>1];

                        max_idx[tx + m&1] = max_idx[tx + (m+1)>>1];

                    }

                }

                (++m)>>=1;

                //__syncthreads();

            }

if (tx==0) {

                m = piv[n];

                piv[n] = piv[max_idx[0]];

                piv[max_idx[0]] = m;

                v = 1.0f/A[piv[n]][n];

            }

//L block normalization

            if (tx>n) {

                A[piv[tx]][n] *= v;

            }

            __syncthreads();

        }

//reduction

        if (tx>n) {

            for (m=n+1+ty; m<sz; m+=blockDim.y) {

                A[piv[m]][tx] -= A[piv[m]][n]*A[piv[n]][tx];

                //A[m][tx] = max_abs[sz-1];

            }

        }

        __syncthreads();

    }

//copy back to global memory

    for(n=ty; n<sz; n+=blockDim.y){

        AG[bx*sz*sz + n*sz + tx] = A[n][tx];

    }

    if (ty==0) {

        PG[bx*sz + tx] = piv[tx];

    }

}

So, just to continue flogging this topic to death :)

We see a pattern that one matrix per block starts to loose in performance as the matrices becomes larger. At this point it might be interesting to change the parallelization scheme a bit. How about going towards a few (1-4, whatever is best ) as the matrices become larger and reach a critical size ?

Yes, behavior is undefined in that case. I am undecided myself, as for my own codes I certainly insist on reproducible results. For the sz=32 case results seemed to be stable though, and as the code looks so much nicer I decided to present it here.

Yeah, as the shared memory constraint allows fewer and fewer blocks per SM, we can have more and more warps per block without (further) limiting the number of blocks.

Yes, I’d think so too. The CPU is in advantage there because of its larger caches.

I think this reflects the memory hierarchy: At huge sizes, the GPU excels through its about 10x larger memory bandwidth. At small sizes, the multiple SMs (and execution units) provide a large bandwidth advantage over the CPU as well, but there is an intermediate regime where the matrices still fit into CPU caches, but not into on-chip memory on the GPU.

Interesting generalization. I had anticipated having a table of optimal padding sizes, but using [font=“Courier New”]sz | 1[/font] seems to be enough to avoid bank conflicts for arbitrary size.

I’ve found that exchanging rows explicitly (while still keeping the piv vector so we can return it) is slightly beneficial. It would also be advantageous for my proposal below.

Indeed, as matrix size increases occupancy sucks more and more.

I don’t think that an intermediate scheme between one matrix per block and one matrix per thread would buy us much, unless one introduces really clever memory management (one might reuse the top rows which have finished processing as soon as they are enough to hold a full new matrix).

I’ve been thinking about using a blocked algorithm, or using mixed memory schemes (use registers or local memory to hold part of the matrix). But all seems to get ugly pretty quickly.

Unless one uses Fermi’s unified pointers to transparently store the lower, more often accessed part of the matrix in shared and the upper part in local memory.

Note that the first line of your kernel contains a bug, it should read

int bx = blockIdx.x + blockIdx.y*gridDim.x;

Thanks, I should have spotted that earlier. Thankfully, this bug doesn’t affect the execution time, so my stuff above still holds… strangely, however, the exact same code is running slightly slower today… (0.59s vs 0.55s previously).

Propagating announcement of batched solver and inversion code for small matrices from this thread:

http://forums.nvidia.com/index.php?showtopic=210707&st=0&p=1296427&fromsearch=1&#entry1296427

The source code for an efficient solver and matrix inversion for small matrices using partial pivoting, is now available to all registered developers.

On sm_2x GPUs (48KB shared memory), the following matrix dimensions are supported:

DP           solver, non-batched matrix inverse:  2...76

       DP complex   solver, non-batched matrix inverse:  2...53

DP           batched matrix inverse:              2...77

       DP complex   batched matrix inverse:              2...55

The code has been released under BSD license.

It is available from the CUDA registered developer web site:

https://nvdeveloper.nvidia.com