cuBLAS in Fortran OpenMP offloading with Managed Memory

I’m testing numerical programs on GPU by Fortran using OpenACC and OpenMP target offloading. I’m using HPC SDK 24.1 on x86 CPU with A100 workstation. Because my programs use BLAS and LAPACK functions, I want to use cuBLAS and cuSOLVER with OpenACC and OpenMP target offloading.
(To consider the use of various kind of accelerators, I want to use OpnMP target offloading if possible.)

Focus on the cuBLAS, I partialy succeeded them.

examples:

! OpenACC with cuBLAS
real*8, allocatable, dimension(:,:) :: a, b, c
!$acc data copyin(a(1:n,1:n),b(1:n,1:n)) copy(c(1:n,1:n))
!$acc host_data use_device(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc end host_data
!$acc end data

! OpenMP target offloading with cuBLAS
real*8, allocatable, dimension(:,:) :: a, b, c
!$omp target data map(to:a(1:n,1:n),b(1:n,1:n)) map(tofrom:c(1:n,1:n))
!$omp target data use_device_addr(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$omp end target data
!$omp end target data

Because my target code has a little bit complex data structure, to ease the memory management, I want to use Managed Memory (compile with “managed”, such as “-gpu=cc80,managed”), too.

In OpenACC, it was easy.

example:

!$acc host_data use_device(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc end host_data
!$acc wait

or

call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc wait

(Both code provided correct result. Performance is not compared still now.)

However, OpenMP offloading does not produce collect results.
example:

!$omp target data use_device_addr(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$omp end target data
!$omp taskwait
t2 = omp_get_wtime()
(the obtained results c is all 0.0)

I thought some kind of “_device_ptr/_device_addr” clause is required, and tried it, but I obtained compiler error or incorrect result (0.0).

Does anyone know how to use it correctly?
(Do I needs some bother interface implementations?)

The “external” Fortran bindings to cuBLAS need to be compiled first, according to the information here: cuBLAS

On my system, the bindings can be found in the folder /opt/nvidia/hpc_sdk/Linux_x86_64/2024/math_libs/12.3/src/fortran.c

It is possible to use cuBLAS Legacy API (in C) directly however, but you need to write an interface block:

program test_gemm
implicit none

! OpenMP target offloading with cuBLAS
real*8, allocatable, dimension(:,:) :: a, b, c
integer :: n 

interface
   subroutine cublasDgemm(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc) &
         bind(c,name='cublasDgemm')
      use iso_c_binding
      integer(c_int), value :: m, n, k, lda, ldb, ldc
      real(c_double), intent(in), device :: a(lda,*), b(ldb,*)
      real(c_double), intent(inout), device :: c(ldc,*)
      real(c_double), value :: alpha, beta
      character(kind=c_char), value :: transa, transb
   end subroutine
   subroutine cublasInit() bind(c,name="cublasInit")
   end subroutine
   subroutine cublasShutdown() bind(c,name="cublasShutdown")
   end subroutine
end interface

call cublasInit

n = 3
allocate(a(n,n),b(n,n),c(n,n))

call random_number(a)
call random_number(b)
call random_number(c)

!$omp target data map(to:a,b) map(tofrom:c)

call cublasDgemm('N','N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)

!$omp end target data

print *, maxval(abs(c - matmul(a,b)))

call cublasShutdown

end program
$ nvfortran -cuda -mp=gpu -cudalibs test_gemm.f90 
$ ./a.out
    0.000000000000000     

Note: this is just what I discovered in my own experimentation with the NVHPC SDK.

Can you try with “use_device_ptr”? I talked with engineering and we might have an issue with “addr”.

Though since you’re using “managed” and the arrays are allocatable, the “use_device” region should be optional given the unified address is accessible on both host and device.

@ivan.pribec1, fyi we provide a Fortran interface module for cuBlas (i.e. ‘use cublas’) as well as most of the CUDA math libraries. The interfaces are documented at: NVIDIA Fortran CUDA Library Interfaces Version 24.3 for ARM, OpenPower, x86. You’re welcome to use your own, but it’s not necessary.

1 Like

@ivan.pribec1
Thanks for your advice. I succeeded to get the right result by using your interface.

@MatColgrove
Thanks for your advice. I have tried “!$omp target data use_device_ptr(a,b,c)”, but I got the wrong result (0.0).

Afterthat, I noticed that I didn’t use cublasInit/cublasShutdown in my program.
And, I suceeded to get right result by inserting them.

This is the main part of source code and compile options in which I got right result.

use cublas
real*8, allocatable, dimension(:,:) :: a, b, c
allocate(a(n,n), b(n,n), c(n,n))
! set data (omit)
ierr = cublasInit()
!$omp target data use_device_ptr(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$omp end target data
! output c (omit)
ierr = cublasShutdown()
deallocate(a, b, c)

$ nvfortran -cuda -mp=gpu -gpu=cc80,managed -cudalib=cublas -g -o cublas_omp_mf.out cublas_omp_m.f90

I didn’t use cublasInit/cublasShutdown functions in my another CUDA Fortran + cuBLAS program, but I got the right result. Therefore, I forget to use them, and I didn’t think this is a reason why I can’t get right result.

Thank you very much!

I read cuBLAS document again.
Using cublasInit is recommended but not required in the document.
Therefore, behavior in my code (can’t get right result without cublasInit) is strange.
Is this a bug of cuBLAS with OpenMP offloading?

I’m not sure and would need a reproducing example to investigate.

My best guess is that the implicit cuBLAS initialization may not be inheriting the CUDA context created by the OpenMP runtime and instead using it’s own.

Thank you for your comment.
I agree with your opinion.

I hope anyone trying to do the same will notice this thread and avoid the problem!