Bug in cusolverDnDsyevj

I am working with an electronic structure code that needs to use a symmetric double precision eigenvalue routine. Previously we had been using magma_dsyevd_gpu for this purpose but we wanted to make the use of magma optional since it represents an additional dependency for users. To that end I’ve been looking at cusolverDnDsyevj and cusolverDnDsyevd. The latter works fine. The former works in most cases but for one of our test cases the found eigenvalues are sorted incorrectly. According to the cusolver documentation this routine should always sort eigenvalues in ascending order but that is not the case. Note that the eigenvalues were the same as the eigenvalues from cusolverDnDsyevd. It’s just that they were not sorted correctly. The matrix size was N=11232.

The returned eigenvalue distribution was not random. They were returned in blocks that were sorted in ascending order but the blocks themselves were out of order.

my suggestion would be to file a bug at http://developer.nvidia.com

provide a full complete code test case, as well as indicate the platform you are running on (OS, GPU, compile command line).

Unfortunately the test case is not readily repeatable as it runs on a cluster and requires several hundred Voltas as well as 50 Tbytes of host memory on the cluster nodes. I have smaller test cases but have not seen the problem with those.

I went ahead and filed a bug report there. Not a pressing issue since there are other routines that work correctly.

cusolver is not a multi-GPU aware library. Therefore if you think cusolver has a defect, it should be demonstratable with a test case that does not require a cluster or several hundred voltas.

I understand that it may require some effort, but if you believe a call to cusolverDnDsyevj is failing, it should be conceptually straightforward to capture the input data and output data and build that into a test case. Without that, it may be very unlikely that any progress would be made on the defect report.

So don’t be surprised if they respond to your defect report with a request for a self-contained test case/reproducer. It is a pretty typical expectation.

Maybe I wasn’t clear but you seem to have missed the point. While the eigensolver only runs on one GPU the matrix is generated from an application that requires several hundred Volta’s and Tbytes of host memory. If the problem is specific to this particular matrix then you obviously need this matrix as a test case. Generating it requires a large system. Alternatively I could save it to disk and that may be what I wind up doing but since the matrix is around a Gigabyte attaching it to a bug report is probably not going to work.

Exactly. My suggestion would be to create a self-contained reproducer. That would include a way to supply all the necessary input data for the function call that you suspect.

The team on the other side of the bug reporting system should be able to provide you a temporary FTP system to upload any large files. It’s common practice in these cases. NVIDIA has a an easy to use temporary FTP system that we use to communicate in these cases.

So it turns out the original matrix is not required to reproduce the problem. Just one that is similar to it. The bug was manifesting with a matrix that had an eigenvalue spectrum that is common in electronic structure codes. Specifically a well separated but highly degenerate set of discrete bands. I went ahead and wrote a test program to generate matrices with similar spectra and the bug is reproducible. I’ll add the test code to the bug report I filed.

Bug number #2098639