With NVIDIA HPC SDK 21.5, NVBLAS with SCALAPACK is generating incorrect results. How to debug? To reproduce:
# Copy the scalapack example
cp -a /opt/nvidia/hpc_sdk/Linux_aarch64/21.5/examples/MPI/scalapack .
# Edit the makefile to link NVBLAS
sed -i -e 's#-Mscalapack#-Mscalapack -L/opt/nvidia/hpc_sdk/Linux_aarch64/21.5/math_libs/11.3/targets/sbsa-linux/lib -lnvblas#' Makefile
# Increase total available memory
sed -i -e 's/TOTMEM = 4000000/TOTMEM=268435456/' pdludriver.f
# Configure NVBLAS
cat >nvblas.conf <<EOF
NVBLAS_LOGFILE nvblas.log
NVBLAS_TRACE_LOG_ENABLED
NVBLAS_CPU_BLAS_LIB /opt/nvidia/hpc_sdk/Linux_aarch64/21.5/compilers/lib/libblas.so
NVBLAS_GPU_LIST ALL
NVBLAS_AUTOPIN_MEM_ENABLED
EOF
# Configure SCALAPACK driver
cat >LU.dat <<EOF
'SCALAPACK, Version 2.0, LU factorization input file'
'NVHPC Scalapack example, 2 processors.'
'LU.out' output file name (if any)
6 device out
1 number of problems sizes
4096 values of M
2048 values of N
1 number of NB's
64 values of NB
1 number of NRHS's
1 values of NRHS
1 number of NBRHS's
1 values of NBRHS
1 number of process grids (ordered pairs of P & Q)
1 values of P
2 values of Q
1.0 threshold
T (T or F) Test Cond. Est. and Iter. Ref. Routines
# Compile and run
make
On my system I see:
Relative machine precision (eps) is taken to be 0.111022E-15
Routines pass computational tests if scaled residual is less than 1.0000
TIME M N NB NRHS NBRHS P Q LU Time Sol Time MFLOPS CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------
272764452 A 281473292836912 281473292836912 10000... 281473292836912 12... 281470681743372... 281473292836904 281473292836904 -1683873748
272764452 A 281473368793136 281473368793136 10000... 281473368793136 2... 281470681743362... 281473368793128 281473368793128 -1607917524
281474532195164 R 281473292887600 281473292887600 2049... 281473292887600 274877906944... 0... 281473292887472 281473292887472 -444515508
281474229047692 R 281473368843824 281473368843824 2049... 281473368843824 274877906944... 0... 281473368843696 281473368843696 -747662980
||A - P*L*U|| / (||A|| * N * eps) = 0.7540528E+15
WALL 4096 2048 64 0 0 1 2 0.55 0.00 26212.39 FAILED
Finished 1 tests, with the following results:
0 tests completed and passed residual checks.
1 tests completed and failed residual checks.
0 tests skipped because of illegal input values.
But if I disable NVBLAS DGEMM calls, it works again:
echo "NVBLAS_GPU_DISABLED_DGEMM" >> nvblas.conf
make run
Relative machine precision (eps) is taken to be 0.111022E-15
Routines pass computational tests if scaled residual is less than 1.0000
TIME M N NB NRHS NBRHS P Q LU Time Sol Time MFLOPS CHECK
---- ----- ----- --- ---- ----- ---- ---- -------- -------- -------- ------
272764452 A 281473343234096 281473343234096 10000... 281473343234096 12... 281470681743372... 281473343234088 281473343234088 -1633476564
272764452 A 281472832774192 281472832774192 10000... 281472832774192 2... 281470681743362... 281472832774184 281472832774184 -2143936468
281474106245708 R 281473343284784 281473343284784 2049... 281473343284784 274877906944... 0... 281473343284656 281473343284656 -870464964
281474049853340 R 281472832824880 281472832824880 2049... 281472832824880 274877906944... 0... 281472832824752 281472832824752 -926857332
WALL 4096 2048 64 0 0 1 2 0.75 0.00 19049.75 PASSED
What should I try next?