Lots of small matrices

tera · April 24, 2011, 2:05pm

Yes, behavior is undefined in that case. I am undecided myself, as for my own codes I certainly insist on reproducible results. For the sz=32 case results seemed to be stable though, and as the code looks so much nicer I decided to present it here.

Yeah, as the shared memory constraint allows fewer and fewer blocks per SM, we can have more and more warps per block without (further) limiting the number of blocks.

Yes, I’d think so too. The CPU is in advantage there because of its larger caches.

I think this reflects the memory hierarchy: At huge sizes, the GPU excels through its about 10x larger memory bandwidth. At small sizes, the multiple SMs (and execution units) provide a large bandwidth advantage over the CPU as well, but there is an intermediate regime where the matrices still fit into CPU caches, but not into on-chip memory on the GPU.

Interesting generalization. I had anticipated having a table of optimal padding sizes, but using [font=“Courier New”]sz | 1[/font] seems to be enough to avoid bank conflicts for arbitrary size.

I’ve found that exchanging rows explicitly (while still keeping the piv vector so we can return it) is slightly beneficial. It would also be advantageous for my proposal below.

Indeed, as matrix size increases occupancy sucks more and more.

I don’t think that an intermediate scheme between one matrix per block and one matrix per thread would buy us much, unless one introduces really clever memory management (one might reuse the top rows which have finished processing as soon as they are enough to hold a full new matrix).

I’ve been thinking about using a blocked algorithm, or using mixed memory schemes (use registers or local memory to hold part of the matrix). But all seems to get ugly pretty quickly.

Unless one uses Fermi’s unified pointers to transparently store the lower, more often accessed part of the matrix in shared and the upper part in local memory.

Topic		Replies	Views
Batched solver code available CUDA Programming and Performance	29	14763	July 17, 2023
numerous, but small-sized matrix inversions looking 4 advise how-to speed-up problem CUDA Programming and Performance	4	3227	August 20, 2008
troubles in matrix inversion takes more time than cpu CUDA Programming and Performance	4	1254	March 7, 2012
LU factorization code CUDA Programming and Performance	45	91046	June 23, 2015
Inverse of a 3x3 matrix CUDA Programming and Performance	22	9697	February 20, 2024
LU, QR and Cholesky factorizations using GPU CUDA Programming and Performance	100	63161	June 23, 2015
Matrix multiplication woes large inner, small outer dimensions CUDA Programming and Performance	21	10312	March 24, 2009
Matrix Inversion with cublasSgetri GPU-Accelerated Libraries	17	8873	April 10, 2019
How to perform PyCUDA 4x4 matrix inversion with same accuracy than numpy linalg CUDA Programming and Performance	3	1047	March 28, 2019
Poor results with LAPACK CUDA Programming and Performance	8	12222	March 25, 2009

Lots of small matrices

Related topics