SCALAPACK with NVBLAS

With NVIDIA HPC SDK 21.5, NVBLAS with SCALAPACK is generating incorrect results. How to debug? To reproduce:

# Copy the scalapack example
cp -a /opt/nvidia/hpc_sdk/Linux_aarch64/21.5/examples/MPI/scalapack .

# Edit the makefile to link NVBLAS
sed -i -e 's#-Mscalapack#-Mscalapack -L/opt/nvidia/hpc_sdk/Linux_aarch64/21.5/math_libs/11.3/targets/sbsa-linux/lib -lnvblas#' Makefile

# Increase total available memory
sed -i -e 's/TOTMEM = 4000000/TOTMEM=268435456/' pdludriver.f

# Configure NVBLAS
cat >nvblas.conf <<EOF
NVBLAS_LOGFILE  nvblas.log
NVBLAS_TRACE_LOG_ENABLED
NVBLAS_CPU_BLAS_LIB  /opt/nvidia/hpc_sdk/Linux_aarch64/21.5/compilers/lib/libblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_AUTOPIN_MEM_ENABLED
EOF

# Configure SCALAPACK driver
cat >LU.dat <<EOF
'SCALAPACK, Version 2.0,  LU factorization input file'
'NVHPC Scalapack example, 2 processors.'
'LU.out'             output file name (if any)
6                    device out
1                    number of problems sizes
4096                 values of M
2048                 values of N
1                    number of NB's
64                   values of NB
1                    number of NRHS's
1                    values of NRHS
1                    number of NBRHS's
1                    values of NBRHS
1                    number of process grids (ordered pairs of P & Q)
1                    values of P
2                    values of Q
1.0                  threshold
T                    (T or F) Test Cond. Est. and Iter. Ref. Routines

# Compile and run
make

On my system I see:

Relative machine precision (eps) is taken to be       0.111022E-15
Routines pass computational tests if scaled residual is less than   1.0000

TIME     M     N  NB NRHS NBRHS    P    Q  LU Time Sol Time  MFLOPS  CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------

272764452 A   281473292836912 281473292836912 10000... 281473292836912 12... 281470681743372... 281473292836904 281473292836904 -1683873748
272764452 A   281473368793136 281473368793136 10000... 281473368793136 2... 281470681743362... 281473368793128 281473368793128 -1607917524
281474532195164 R   281473292887600 281473292887600 2049... 281473292887600 274877906944... 0... 281473292887472 281473292887472 -444515508
281474229047692 R   281473368843824 281473368843824 2049... 281473368843824 274877906944... 0... 281473368843696 281473368843696 -747662980
||A - P*L*U|| / (||A|| * N * eps) =             0.7540528E+15
WALL  4096  2048  64     0    0    1    2     0.55     0.00 26212.39 FAILED

Finished      1 tests, with the following results:
    0 tests completed and passed residual checks.
    1 tests completed and failed residual checks.
    0 tests skipped because of illegal input values.

But if I disable NVBLAS DGEMM calls, it works again:

echo "NVBLAS_GPU_DISABLED_DGEMM" >> nvblas.conf
make run
Relative machine precision (eps) is taken to be       0.111022E-15
Routines pass computational tests if scaled residual is less than   1.0000

TIME     M     N  NB NRHS NBRHS    P    Q  LU Time Sol Time  MFLOPS  CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------

272764452 A   281473343234096 281473343234096 10000... 281473343234096 12... 281470681743372... 281473343234088 281473343234088 -1633476564
272764452 A   281472832774192 281472832774192 10000... 281472832774192 2... 281470681743362... 281472832774184 281472832774184 -2143936468
281474106245708 R   281473343284784 281473343284784 2049... 281473343284784 274877906944... 0... 281473343284656 281473343284656 -870464964
281474049853340 R   281472832824880 281472832824880 2049... 281472832824880 274877906944... 0... 281472832824752 281472832824752 -926857332
WALL  4096  2048  64     0    0    1    2     0.75     0.00 19049.75 PASSED

What should I try next?