Where to find NVLAmath examples of zheevd/dsyevd?

Hello,

Where can I find examples of DSYEVD or ZHEEVD implementation through the use of NVLAmath wrapper library? I know that this webpage exists ( NVIDIA Fortran CUDA Interfaces — NVIDIA Fortran CUDA Interfaces 26.1 documentation ) but there are examples of only dgetrf and dgetrs which I could not relate to because they do not make use of any workspace array. Kindly help me out in this regard. A link to an existing NVIDIA document regarding the same would be very helpful. Thanks in advance!

Hi sk99,

The intent of the NVLAmath interface is that you don’t need to make any modifications to your existing code, but simply compile with the flags “-stdpar -gpu=nvlamath -cudalib=nvlamath”. The “-gpu=nvlamath” flag brings in the “nvhpc_nvlamath” module which has a generic interface that gets mapped to the cuSolver routine.

In other words, you should be able to simply add these flags to your existing program and these routines are implicitly offloaded.

One caveat is that we need to use CUDA Unified Memory to handle the data management. On systems like NVIDIA’s Grace that use HMM, then all host memory is accessible on the device. non-HMM system would use “managed” memory, where only allocated data is accessible.

The “-stdpar” flag, implicitly sets the CUDA Unified Memory “-gpu=mem:managed” or “-gpu=mem:unified” flag depending on our system.

To test, I did find this simple dsyevd example, which I confirmed by profiling the code with Nsight-systems, does get implicitly offloaded.

program dsyevd_example
  implicit none
  integer, parameter :: n = 3
  double precision, allocatable :: A(:,:), W(:)
  double precision, allocatable :: work(:)
  integer, allocatable :: iwork(:)
  integer :: lwork, liwork, info

  allocate(A(n,n))
  allocate(W(n))
  ! Example symmetric matrix
  A = reshape([ 1.0d0, 2.0d0, 3.0d0, &
                2.0d0, 4.0d0, 5.0d0, &
                3.0d0, 5.0d0, 6.0d0 ], shape(A))

  print *, "Input matrix A:"
  call print_matrix(A, n)

  ! Workspace query
  lwork = -1
  liwork = -1
  allocate(work(1), iwork(1))
  call dsyevd('V', 'U', n, A, n, W, work, lwork, iwork, liwork, info)

  ! Allocate optimal workspace
  lwork = int(work(1))
  liwork = iwork(1)
  deallocate(work, iwork)
  allocate(work(lwork), iwork(liwork))

  ! Actual computation
  call dsyevd('V', 'U', n, A, n, W, work, lwork, iwork, liwork, info)

  if (info /= 0) then
     print *, "DSYEVD failed, INFO =", info
     stop
  end if

  print *, "Eigenvalues:"
  print *, W

  print *, "Eigenvectors (columns of A):"
  call print_matrix(A, n)

  deallocate(A)
  deallocate(w)
contains

  subroutine print_matrix(M, n)
    double precision, intent(in) :: M(n,n)
    integer, intent(in) :: n
    integer :: i
    do i = 1, n
       print '(3F10.5)', M(i,:)
    end do
  end subroutine print_matrix

end program dsyevd_example

Compiled for the host:

% nvfortran -Ofast -llapack dsyevd1.F90; a.out
 Input matrix A:
   1.00000   2.00000   3.00000
   2.00000   4.00000   5.00000
   3.00000   5.00000   6.00000
 Eigenvalues:
  -0.5157294715892563        0.1709151888271795         11.34481428276207
 Eigenvectors (columns of A):
   0.73698   0.59101   0.32799
   0.32799  -0.73698   0.59101
  -0.59101   0.32799   0.73698

Same code compiled for the GPU (on a GH100 system):

% nvfortran -stdpar -gpu=nvlamath -cudalib=nvlamath dsyevd1.F90; a.out
 Input matrix A:
   1.00000   2.00000   3.00000
   2.00000   4.00000   5.00000
   3.00000   5.00000   6.00000
 Eigenvalues:
  -0.5157294715892582        0.1709151888271763         11.34481428276208
 Eigenvectors (columns of A):
   0.73698   0.59101   0.32799
   0.32799  -0.73698   0.59101
  -0.59101   0.32799   0.73698

And the nsys profile from the command “nsys profile --stats=true a.out”

Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                                                                                               
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  ----------------------------------------------------------------                                                                                                                   ------------------------------------
     76.2          230,752          1  230,752.0  230,752.0   230,752   230,752          0.0  __pgi_dev_cumemset_8n                                                                                                                                                              
      8.1           24,448          1   24,448.0   24,448.0    24,448    24,448          0.0  void lansy_M_stage1<double, double, (int)8>(long, const T1 *, un                                                                                                                   signed long, T2 *, int)
      7.1           21,440          3    7,146.7    1,568.0     1,344    18,528      9,857.2  copy_info_kernel(int, int *)                                                                                                                                                       
      2.3            6,976          1    6,976.0    6,976.0     6,976     6,976          0.0  void steqr_ker<stedc_params_<double2, double, (int)8, (int)16, (                                                                                                                   int)128>, double, (int)0>(long, T1:…
      1.0            2,976          1    2,976.0    2,976.0     2,976     2,976          0.0  void sytrd4_cta<sytrd_params<double, (int)16, (int)32, (int)32,                                                                                                                    (int)32, (int)0, (int)1, (int)2>, (…
      0.7            2,240          1    2,240.0    2,240.0     2,240     2,240          0.0  void hermitianize_matrix_using_upper_triangular_part<double, (lo                                                                                                                   ng)256>(long, T1 *, long)
      0.7            2,144          1    2,144.0    2,144.0     2,144     2,144          0.0  void epilogue<sytrd_params<double, (int)32, (int)8, (int)512, (i                                                                                                                   nt)32, (int)16, (int)1, (int)2>, (i…
      0.7            2,048          1    2,048.0    2,048.0     2,048     2,048          0.0  void ormqr_cta_kernel<double, (int)4, (int)1>(long, long, long,                                                                                                                    const T1 *, long, long, const T1 *,…
      0.6            1,920          1    1,920.0    1,920.0     1,920     1,920          0.0  void lansy_M_stage2<double, (int)8>(long, T1 *)                                                                                                                                    
      0.6            1,856          1    1,856.0    1,856.0     1,856     1,856          0.0  void scale_max<double, (int)32, (int)32, (int)1>(long, long, T1                                                                                                                    *, T1 *, T1 *, int *)
      0.6            1,728          1    1,728.0    1,728.0     1,728     1,728          0.0  void scale_max<double, (int)32, (int)128, (int)0>(long, long, T1                                                                                                                    *, T1 *, T1 *, int *)
      0.5            1,472          1    1,472.0    1,472.0     1,472     1,472          0.0  __pgi_dev_cumemset_4n                                                                                                                                                              
      0.5            1,440          1    1,440.0    1,440.0     1,440     1,440          0.0  void lacpy_kernel<double, double, (int)5, (int)3>(long, long, co                                                                                                                   nst T1 *, long, long, T2 *, long, l…
      0.4            1,248          1    1,248.0    1,248.0     1,248     1,248          0.0  xx_set_info_ker(int, int *, int)

Hope this helps,
Mat

1 Like

Thanks a lot for such a detailed reply, Mat!

I just wanted to ask one more question: I have written my code in the following way but without using -stdpar flag. Below is the code snippet as well as the flags and environment that I set for compilation and execution. Please let me know whether this can also be an alternative way to do the same thing that you did.

The code:

!---------------------------------------
SUBROUTINE ddiag( Avec, row, col, Aval )

!input
integer, intent(in) :: row        !leading dimension (lda)
integer, intent(in) :: col        !order (n)

!output
real(8), intent(inout) :: Aval(:) !length = col
real(8), intent(inout) :: Avec(:) !1D, length = row*col, column-major

!local
real(8), parameter   :: eps = 2.2204460492503131d-16
character(len=1)     :: jobz, uplo
real(8)              :: diff, scal, tol
integer              :: lwork, liwork, info, i, j, ij, ji
real(8), allocatable :: work (:)
integer, allocatable :: iwork(:)

#ifdef _CUDA
real(8), device, allocatable :: dwork(:)
integer, device, allocatable :: diwork(:)

real(8), device, allocatable :: dA(:)
real(8), device, allocatable :: dW(:)

real(8), allocatable :: hwork (:)
integer, allocatable :: hiwork(:)
integer              :: istat

endif

!sanity check(s)
if( row <= 0 )then
  write(6,'(//2x,a//)') 'ERROR: Insufficient row dimension in ddiag => STOP'
  call flush(6); stop
endif
if( col <= 0 )then
  write(6,'(//2x,a//)') 'ERROR: Insufficient column dimension in dqmatrix.diag.ddiag => STOP'
  call flush(6); stop
endif
if( row /= col )then
  write(6,'(//2x,a//)') 'ERROR: Matrix has to be square in ddiag => STOP'
  call flush(6); stop
endif
if( size(Avec) < row*col )then
  write(6,'(//2x,a//)') 'ERROR: Avec is too small in ddiag => STOP'
  call flush(6); stop
endif
if( size(Aval) < col )then
  write(6,'(//2x,a//)') 'ERROR: Aval is too small in ddiag => STOP'
  call flush(6); stop
endif

tol = 100.0d0 * eps * dble(col)
wtime = GetTime()
!$omp parallel do collapse(2) default(none) private(i,j,ij,ji,diff,scal) shared(row,col,Avec,tol)
do j=1,col
  do i=1,row
    ij = index(i,j,row)
    ji = index(j,i,row)
    diff = dabs( Avec(ij) - Avec(ji) )
    scal = dmax1( dabs(Avec(ij)), dabs(Avec(ji)), done )
    if( diff/scal > tol )then
      write (6,'(//2x,a//)') 'ERROR: The matrix is not symmetric in ddiag => STOP'
      call flush(6); stop
    endif
  enddo
enddo
!$omp end parallel do
time(14) = time(14) + GetTime() - wtime

jobz = 'V'
uplo = 'L'

#ifdef _CUDA
wtime = GetTime()
allocate(dwork(1), diwork(1))
allocate(dA(row*col), dW(col))
time(4) = time(4) + GetTime() - wtime

wtime = GetTime()
allocate(hwork(1), hiwork(1))
time(1) = time(1) + GetTime() - wtime

wtime = GetTime()
dA = Avec
dW = Aval
istat = cudaDeviceSynchronize()!NEW - only for pageable host and H2D transfer, not for D2H.
time(5) = time(5) + GetTime() - wtime

wtime = GetTime()
call dsyevd(jobz, uplo, col, dA, row, dW, dwork, -1, diwork, -1, info)
istat = cudaDeviceSynchronize()
time(15) = time(15) + GetTime() - wtime

if(info /= 0)then
  write (6,'(//2x,a//)') 'ERROR: Workspace query failed in ddiag, info = ', info,' => STOP'
  call flush(6); stop
endif

wtime = GetTime()
hwork  = dwork
hiwork = diwork
time(7) = time(7) + GetTime() - wtime

lwork  = int(real(hwork(1)))
liwork = hiwork(1)

wtime = GetTime()
deallocate(dwork, diwork)
time(8) = time(8) + GetTime() - wtime

wtime = GetTime()
deallocate(hwork, hiwork)
time(2) = time(2) + GetTime() - wtime

wtime = GetTime()
allocate(dwork(lwork), diwork(liwork))
time(4) = time(4) + GetTime() - wtime


!diagonalize
wtime = GetTime()
call dsyevd(jobz, uplo, col, dA, row, dW, dwork, lwork, diwork, liwork, info)
istat = cudaDeviceSynchronize()
time(16) = time(16) + GetTime() - wtime

if(info /= 0)then
  write (6,'(//2x,a//)') 'ERROR: DSYEVD failed in ddiag, info = ', info,' => STOP'
  call flush(6); stop
endif

wtime = GetTime()
Avec = dA
Aval = dW
time(7) = time(7) + GetTime() - wtime

wtime = GetTime()
deallocate(dwork, diwork, dA, dW)
time(8) = time(8) + GetTime() - wtime

#else
wtime = GetTime()
allocate(work(1), iwork(1))
time(1) = time(1) + GetTime() - wtime

wtime = GetTime()
call dsyevd(jobz, uplo, col, Avec, row, Aval, work, -1, iwork, -1, info)
time(17) = time(17) + GetTime() - wtime

if (info /= 0) then
  write (6,'(//2x,a//)') 'ERROR: Workspace query failed in ddiag, info = ', info,' => STOP'
  call flush(6); stop
endif

lwork  = int(real(work(1)))
liwork = iwork(1)

wtime = GetTime()
deallocate(work, iwork)
time(2) = time(2) + GetTime() - wtime

wtime = GetTime()
allocate(work(lwork), iwork(liwork))
time(1) = time(1) + GetTime() - wtime


!diagonalize
wtime = GetTime()
call dsyevd(jobz, uplo, col, Avec, row, Aval, work, lwork, iwork, liwork, info)
time(18) = time(18) + GetTime() - wtime

if (info /= 0) then
  write (6,'(//2x,a//)') 'ERROR: DSYEVD failed in ddiag, info = ', info,' => STOP'
  call flush(6); stop
endif

wtime = GetTime()
deallocate(work, iwork)
time(2) = time(2) + GetTime() - wtime

endif

END SUBROUTINE

The snippets from the script:

#!/bin/bash
#SBATCH --job-name=diag
#SBATCH --partition=gpu
#SBATCH --exclusive
#SBATCH --nodelist=comp1
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=64
#SBATCH --time=0-08:00
#SBATCH --output=diag2.txt

-------------------------------------------------------------------

Select build/run mode:

gfortran = CPU, GCC BLAS/LAPACK, OpenMP

ifort = CPU, Intel MKL, OpenMP (discontinued from late 2024)

ifx = CPU, Intel MKL, OpenMP

nvfortran_cpu = CPU, NVHPC NVPL BLAS/LAPACK, OpenMP

nvfortran_gpu = GPU (cuBLAS/NVLAmath) + CPU OpenMP

-------------------------------------------------------------------

MODE=nvfortran_gpu

case “$MODE” in

nvfortran_gpu) #machine-specific, since DEVANA didn’t need LIBRARY_PATH and linking (-L)
module purge
module load NVHPC/25.3-CUDA-12.8.0
export LD_LIBRARY_PATH=/home/soft/easybuild/software/NVHPC/25.3-CUDA-12.8.0/Linux_x86_64/25.3/math_libs/12.8/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
ulimit -s unlimited
export OMP_NUM_THREADS=64
export OMP_STACKSIZE=4000m
FC=nvfortran
FCFLAGS=“-O3 -mp -gpu=cc80,nvlamath -tp=host -Mlarge_arrays”
CPPFLAGS=“-D_CUDA” # Enable CUDA sections (#ifdef _CUDA)
LIBS=“-L/home/soft/easybuild/software/NVHPC/25.3-CUDA-12.8.0/Linux_x86_64/25.3/math_libs/12.8/targets/x86_64-linux/lib -cuda -cudalib=cublas,nvlamath”
;;

*)
echo “ERROR: Unknown MODE=‘$MODE’” >&2
exit 1
;;
esac

-------------------------------------------------------------------

Compile and run

-------------------------------------------------------------------

echo “Mode: $MODE”
echo “Compiler: $FC”
echo “FCFLAGS: $FCFLAGS”
echo “LIBS: $LIBS”
echo

echo “Compiling…”
$FC $FCFLAGS $CPPFLAGS
matrix.F90 diag.f90
$LIBS -o main || { echo “Compilation failed!”; exit 1; }

echo “Running…”
./main

echo “Cleaning…”
rm -f main *.o *.mod

(The subroutine ddiag is a part of matrix.F90 module.)

And I forgot to mention, the following is declared at the start of the module (right before the ‘contains’ keyword):

#ifdef _CUDA
use cublas
use cudafor
endif

I believe all you need to do here is add the nvhpc_nvlamath module to get the generic interface.

The example in section 6.6.2 uses:

!@cuf use nvhpc_nvlamath
!@cuf use cutensorex

The “!@cuf” get removed when the “-cuda” flag is added to the command line.
Which is basically the same as your code’s using the “_CUDA” macro.

#ifdef _CUDA
use nvhpc_nvlamath
use cutensorex
#endif

The example also uses the “managed” attribute for one of the arrays, but I don’t believe that’s necessary here as you do manage the data movement.

1 Like