Lots of small matrices

Yes, behavior is undefined in that case. I am undecided myself, as for my own codes I certainly insist on reproducible results. For the sz=32 case results seemed to be stable though, and as the code looks so much nicer I decided to present it here.

Yeah, as the shared memory constraint allows fewer and fewer blocks per SM, we can have more and more warps per block without (further) limiting the number of blocks.

Yes, I’d think so too. The CPU is in advantage there because of its larger caches.

I think this reflects the memory hierarchy: At huge sizes, the GPU excels through its about 10x larger memory bandwidth. At small sizes, the multiple SMs (and execution units) provide a large bandwidth advantage over the CPU as well, but there is an intermediate regime where the matrices still fit into CPU caches, but not into on-chip memory on the GPU.

Interesting generalization. I had anticipated having a table of optimal padding sizes, but using [font=“Courier New”]sz | 1[/font] seems to be enough to avoid bank conflicts for arbitrary size.

I’ve found that exchanging rows explicitly (while still keeping the piv vector so we can return it) is slightly beneficial. It would also be advantageous for my proposal below.

Indeed, as matrix size increases occupancy sucks more and more.

I don’t think that an intermediate scheme between one matrix per block and one matrix per thread would buy us much, unless one introduces really clever memory management (one might reuse the top rows which have finished processing as soon as they are enough to hold a full new matrix).

I’ve been thinking about using a blocked algorithm, or using mixed memory schemes (use registers or local memory to hold part of the matrix). But all seems to get ugly pretty quickly.

Unless one uses Fermi’s unified pointers to transparently store the lower, more often accessed part of the matrix in shared and the upper part in local memory.