cusolverDnXsyevd status 6 + XID 31 MMU fault at n=50000, FP64 real, CUDA 13.2

Hello,

I’m hitting a hard kernel failure in cusolverDnXsyevd on a 50,000 × 50,000 real FP64 symmetric matrix. This looks like a continuation of the n≈27k Xsyevd thread, which was reported fixed in CUDA Toolkit 13.0. The failure mode here is different (kernel execution, not bufferSize) but appears to be in the same family of internal int-sizing issues at large n.

Environment

  • CUDA Toolkit: 13.2

  • libcublas: version=130400

  • Driver: 580.95.05

  • GPU: NVIDIA H200

  • Call site: direct C++ (no CuPy/PyTorch in the path)

Call configuration

  • n = 50000, lda = 50000, symmetric real FP64

  • dataTypeA = CUDA_R_64F, dataTypeW = CUDA_R_64F, computeType = CUDA_R_64F

  • jobz = CUSOLVER_EIG_MODE_VECTOR, uplo = CUBLAS_FILL_MODE_LOWER

  • Default cusolverDnParams_t (created via cusolverDnCreateParams, no advanced options set)

Sequence

  1. cusolverDnXsyevd_bufferSize → returns success. Reported workspace: device = [X] bytes, host = [Y] bytes.

  2. Device + host workspace allocated successfully (verified cudaMalloc return).

  3. cusolverDnXsyevd → returns status 6 (Idk if it’s CUSOLVER_STATUS_EXECUTION_FAILED or CUSOLVER_STATUS_INTERNAL_ERROR).

System log (concurrent with the syevd call)

XID 31: NVRM: Xid (PCI:0000:59:00): 31, pid=996597, name=exe, channel 0x0000000c
MMU Fault: ENGINE GRAPHICS GPC1 GPCCLIENT_T1_4 faulted @ 0x2aa1_cc016000
Fault type: FAULT_PDE  ACCESS_TYPE_VIRT_READ

A FAULT_PDE virtual read at a high address from a cuSOLVER GPC kernel strongly suggests an out-of-bounds index inside the syevd pipeline — consistent with an internal element-count or stride being computed/stored in 32 bits somewhere along the call chain (n² = 2.5 × 10⁹ exceeds INT32_MAX). The 13.0 release notes describe the X-API dimension limit as removed; this case suggests the fix may not extend through every internal kernel at this scale.

Reproduction The smallest test case I have is a random symmetric FP64 matrix at n = 50,000 with the call sequence above. I haven’t bisected the n threshold yet — happy to do so and report back if useful. Will also try cusolverDnXsyevdx with a top-k range to see whether the same internal path is involved.

Questions

  1. Is Xsyevd validated at n ≥ 50,000 FP64 in 13.2? The earlier thread topped out at ~27k.

  2. Is there a recommended workaround within cuSOLVER (e.g. Xsyevdx, Xsygvd, or routing through cuSolverMp) that avoids the affected code path while staying single-GPU?

  3. Should I file this as a bug, or is there an existing internal tracker?

Thanks.

I should add that it works reliably for n<= 46000.

Hi spectre,

Thanks for the detailed report. We tried to reproduce on our side and could not, so we’d like to ask for a few more pieces of information.

Our setup, intended to match yours as closely as possible:

  • GPU: NVIDIA H200
  • CUDA toolkit: 13.2 Update 1
  • libcublas: 130400 (same major.minor as your report)
  • libcusolver: 12200
  • Driver: 590.48.01

Two runs at exactly your call signature (jobz=CUSOLVER_EIG_MODE_VECTOR, uplo=lower, n=50000, lda=50000, computeType=CUDA_R_64F, default cusolverDnParams_t):

Both run cleanly, residuals at the expected level. So with the same cusolverDnXsyevd call, the same cuBLAS major version, on the same GPU class, we don’t see the failure — and matrix data does not appear to be the trigger.

To localize the differential, could you a self-contained reproducer. A small .cu that allocates a random symmetric FP64 matrix with a fixed seed (e.g. std::mt19937 rng(42)), calls cusolverDnXsyevd_bufferSize → cudaMalloc → cusolverDnXsyevd, and prints the status.

If you can spare the wall time (~1 h at n=50000), please run

compute-sanitizer ./your_binary

and attach the output. This typically pins the failing kernel and offending access pattern. And gives more diagnostic than the status code alone.