cuBLAS in Fortran OpenMP offloading with Managed Memory

ohshima · April 16, 2024, 1:57am

I’m testing numerical programs on GPU by Fortran using OpenACC and OpenMP target offloading. I’m using HPC SDK 24.1 on x86 CPU with A100 workstation. Because my programs use BLAS and LAPACK functions, I want to use cuBLAS and cuSOLVER with OpenACC and OpenMP target offloading.
(To consider the use of various kind of accelerators, I want to use OpnMP target offloading if possible.)

Focus on the cuBLAS, I partialy succeeded them.

examples:

! OpenACC with cuBLAS
real*8, allocatable, dimension(:,:) :: a, b, c
!$acc data copyin(a(1:n,1:n),b(1:n,1:n)) copy(c(1:n,1:n))
!$acc host_data use_device(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc end host_data
!$acc end data

! OpenMP target offloading with cuBLAS
real*8, allocatable, dimension(:,:) :: a, b, c
!$omp target data map(to:a(1:n,1:n),b(1:n,1:n)) map(tofrom:c(1:n,1:n))
!$omp target data use_device_addr(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$omp end target data
!$omp end target data

Because my target code has a little bit complex data structure, to ease the memory management, I want to use Managed Memory (compile with “managed”, such as “-gpu=cc80,managed”), too.

In OpenACC, it was easy.

example:

!$acc host_data use_device(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc end host_data
!$acc wait

or

call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc wait

(Both code provided correct result. Performance is not compared still now.)

However, OpenMP offloading does not produce collect results.
example:

!$omp target data use_device_addr(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$omp end target data
!$omp taskwait
t2 = omp_get_wtime()
(the obtained results c is all 0.0)

I thought some kind of “_device_ptr/_device_addr” clause is required, and tried it, but I obtained compiler error or incorrect result (0.0).

Does anyone know how to use it correctly?
(Do I needs some bother interface implementations?)

ivan.pribec1 · April 16, 2024, 5:04pm

The “external” Fortran bindings to cuBLAS need to be compiled first, according to the information here: cuBLAS

On my system, the bindings can be found in the folder /opt/nvidia/hpc_sdk/Linux_x86_64/2024/math_libs/12.3/src/fortran.c

It is possible to use cuBLAS Legacy API (in C) directly however, but you need to write an interface block:

program test_gemm
implicit none

! OpenMP target offloading with cuBLAS
real*8, allocatable, dimension(:,:) :: a, b, c
integer :: n 

interface
   subroutine cublasDgemm(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc) &
         bind(c,name='cublasDgemm')
      use iso_c_binding
      integer(c_int), value :: m, n, k, lda, ldb, ldc
      real(c_double), intent(in), device :: a(lda,*), b(ldb,*)
      real(c_double), intent(inout), device :: c(ldc,*)
      real(c_double), value :: alpha, beta
      character(kind=c_char), value :: transa, transb
   end subroutine
   subroutine cublasInit() bind(c,name="cublasInit")
   end subroutine
   subroutine cublasShutdown() bind(c,name="cublasShutdown")
   end subroutine
end interface

call cublasInit

n = 3
allocate(a(n,n),b(n,n),c(n,n))

call random_number(a)
call random_number(b)
call random_number(c)

!$omp target data map(to:a,b) map(tofrom:c)

call cublasDgemm('N','N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)

!$omp end target data

print *, maxval(abs(c - matmul(a,b)))

call cublasShutdown

end program

$ nvfortran -cuda -mp=gpu -cudalibs test_gemm.f90 
$ ./a.out
    0.000000000000000

Note: this is just what I discovered in my own experimentation with the NVHPC SDK.

MatColgrove · April 16, 2024, 8:38pm

Can you try with “use_device_ptr”? I talked with engineering and we might have an issue with “addr”.

Though since you’re using “managed” and the arrays are allocatable, the “use_device” region should be optional given the unified address is accessible on both host and device.

@ivan.pribec1, fyi we provide a Fortran interface module for cuBlas (i.e. ‘use cublas’) as well as most of the CUDA math libraries. The interfaces are documented at: NVIDIA Fortran CUDA Library Interfaces Version 24.3 for ARM, OpenPower, x86. You’re welcome to use your own, but it’s not necessary.

ohshima · April 17, 2024, 12:41am

@ivan.pribec1
Thanks for your advice. I succeeded to get the right result by using your interface.

@MatColgrove
Thanks for your advice. I have tried “!$omp target data use_device_ptr(a,b,c)”, but I got the wrong result (0.0).

Afterthat, I noticed that I didn’t use cublasInit/cublasShutdown in my program.
And, I suceeded to get right result by inserting them.

This is the main part of source code and compile options in which I got right result.

use cublas
real*8, allocatable, dimension(:,:) :: a, b, c
allocate(a(n,n), b(n,n), c(n,n))
! set data (omit)
ierr = cublasInit()
!$omp target data use_device_ptr(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$omp end target data
! output c (omit)
ierr = cublasShutdown()
deallocate(a, b, c)

$ nvfortran -cuda -mp=gpu -gpu=cc80,managed -cudalib=cublas -g -o cublas_omp_mf.out cublas_omp_m.f90

I didn’t use cublasInit/cublasShutdown functions in my another CUDA Fortran + cuBLAS program, but I got the right result. Therefore, I forget to use them, and I didn’t think this is a reason why I can’t get right result.

Thank you very much!

ohshima · April 18, 2024, 2:32am

I read cuBLAS document again.
Using cublasInit is recommended but not required in the document.
Therefore, behavior in my code (can’t get right result without cublasInit) is strange.
Is this a bug of cuBLAS with OpenMP offloading?

MatColgrove · April 18, 2024, 3:27pm

I’m not sure and would need a reproducing example to investigate.

My best guess is that the implicit cuBLAS initialization may not be inheriting the CUDA context created by the OpenMP runtime and instead using it’s own.

ohshima · April 19, 2024, 12:09am

Thank you for your comment.
I agree with your opinion.

I hope anyone trying to do the same will notice this thread and avoid the problem!

Topic		Replies	Views
Drop-in Acceleration of GNU Octave Technical Blog	10	809	August 18, 2017
Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran GPU-Accelerated Libraries	2	1161	April 11, 2014
Behaviour of OpenMP target maps with Fortran arrays nvc, nvc++ and nvfortran	12	44	February 11, 2025
cublasDmatinvBatched typemismatch nvc, nvc++ and nvfortran	6	971	October 28, 2021
Interfacing OpenACC with cublasDgetrfBatched/cusparseDgtsvStridedBatch in Fortran CUDA Programming and Performance	0	566	April 10, 2014
Calling cuBlas from a Fortran program Legacy PGI Compilers	9	769	June 8, 2020
0: copyout Memcpy... FAILED: 4(unspecified launch failure) Legacy PGI Compilers	3	5241	May 13, 2013
cuBLAS Dgemm "Could not Resolve Generic Procedure nvc, nvc++ and nvfortran cuda , hpc	7	1448	December 17, 2021
OpenMP loops and level 1-2 BLAS CUDA Programming and Performance	4	458	January 1, 2024
Compile error: NVFORTRAN-S-0000-Internal compiler error. replace_ast_subtree: unexpected ID ast nvc, nvc++ and nvfortran	4	212	May 27, 2024

cuBLAS in Fortran OpenMP offloading with Managed Memory

Related topics