Syevj performance

hamzafawzi · May 20, 2021, 11:07am

I noticed that the performance of CUSOLVER’s syevj significantly degrades for matrices larger than ~2000, compared to syevd. This is what I got on my GeForce Titan XP, double precision, CUDA version 11.0 – syevj time is for 1 sweep, reported number is GFLOPS/s = 2*n^3/time:

n	syevd	syevj (1 sweep)
256	1.74	7.13
512	6.06	17.87
1024	19.71	25.25
2048	49.09	39.11
4096	88.32	20.84

The problem I’m interested is one where I have a large matrix that is almost diagonal, and so ~1 sweep of syevj is enough to get the eigenvalues.

The same phenomenon happens with single precision instead of double precision.

Concerning syevj with single precision: I noticed that the returned orthogonal matrix can be quite far from being orthogonal. The squared 2-norms of the columns can be off from 1 by about 1e-4 (or less) for matrices of size ~1000 or larger. A simple rescaling of the columns can fix the issue.

mnicely · May 20, 2021, 3:11pm

We believe this should be expected. The asymptotic complexity of Jacobi is much higher than the SYEVD.

This solver targets small matrices (orders of 32, 64, 128, etc), where it can deliver faster performance than SYEVD or QR. We used in the batched SYEVDJ for this reason.

Regarding the orthogonality of the columns, can you adjust the tolerance and/or the number of sweeps.

Here are some slides for reference: https://on-demand.gputechconf.com/gtc/2017/presentation/s7121-lung-sheng-chien-jacobi-based-eigenvalue-solver.pdf

hamzafawzi · May 20, 2021, 6:35pm

Thanks for the quick answer.
I guess I was expecting the GFLOPS/s to increase roughly linearly with n, which is what I observe for n < 2000 (and with syevd), but for larger n it stagnates and even drops. But this is perhaps understandable if the implementation targets small matrices << 1024.

Concerning the orthogonality of columns, the issue I mentioned is with the default tolerance/number of sweeps.

mnicely · May 20, 2021, 7:00pm

Also, in a future version of cuSOLVER, (D/Z)SYEVD performance has been improved slightly for N < 2048