I’m testing numerical programs on GPU by Fortran using OpenACC and OpenMP target offloading. I’m using HPC SDK 24.1 on x86 CPU with A100 workstation. Because my programs use BLAS and LAPACK functions, I want to use cuBLAS and cuSOLVER with OpenACC and OpenMP target offloading.
(To consider the use of various kind of accelerators, I want to use OpnMP target offloading if possible.)
Focus on the cuBLAS, I partialy succeeded them.
examples:
! OpenACC with cuBLAS
real*8, allocatable, dimension(:,:) :: a, b, c
!$acc data copyin(a(1:n,1:n),b(1:n,1:n)) copy(c(1:n,1:n))
!$acc host_data use_device(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc end host_data
!$acc end data
! OpenMP target offloading with cuBLAS
real*8, allocatable, dimension(:,:) :: a, b, c
!$omp target data map(to:a(1:n,1:n),b(1:n,1:n)) map(tofrom:c(1:n,1:n))
!$omp target data use_device_addr(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$omp end target data
!$omp end target data
Because my target code has a little bit complex data structure, to ease the memory management, I want to use Managed Memory (compile with “managed”, such as “-gpu=cc80,managed”), too.
In OpenACC, it was easy.
example:
!$acc host_data use_device(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc end host_data
!$acc wait
or
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$acc wait
(Both code provided correct result. Performance is not compared still now.)
However, OpenMP offloading does not produce collect results.
example:
!$omp target data use_device_addr(a,b,c)
call cublasDgemm(‘N’,‘N’, n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n)
!$omp end target data
!$omp taskwait
t2 = omp_get_wtime()
(the obtained results c is all 0.0)
I thought some kind of “_device_ptr/_device_addr” clause is required, and tried it, but I obtained compiler error or incorrect result (0.0).
Does anyone know how to use it correctly?
(Do I needs some bother interface implementations?)