OpenACC with cuBLAS and cuSPARSE in Fortran code

emath · February 16, 2016, 2:24pm

Hi, I’m writing a Fortran code using cuBLAS and cuSPARSE libraries and I need to perform the simple operation of multiplying each vector element with itself.
Since the Hadamard product is not supported in cuBLAS I have to write the accelerator code using OpenACC. Suppose that CPU vector X has the accelerator pointer XP, the following code gives me correct values
.
.
.
i=cuda_memcpy_c2fort_real(X,XP,n*8,2)
!$acc parallel
X(:)=X(:)X(:)
!$acc end parallel
i=cuda_memecpy_fort2c_real(XP,X,n8,1)
.
.
.
Is it possible to avoid moving vector’s values from accelerator to CPU in order to perform the operation in vector X with OpenACC ?

MatColgrove · February 16, 2016, 8:39pm

Hi emath,

Just to be clear, “X” is a device array created using CUDA C Malloc or acc_malloc? If so, then yes. Just add “deviceptr(X)” on the parallel directive to tell the compiler that X is a device array. Though, “X” must be a dummy argument and can’t have a pointer, allocatable or value attribute.

i=cuda_memcpy_c2fort_real(X,XP,n*8,2) 
!$acc parallel deviceptr(X)
 X(:)=X(:)*X(:) 
!$acc end parallel 
 i=cuda_memecpy_fort2c_real(XP,X,n*8,1)

If you’re using a CUDA Fortran “device” array, then you shouldn’t need to do anything. The compiler knows to use the device array in this case.

Alternatively, you can use OpenACC to manage your device data. To pass in the device array to a CUDA routine, use the “host_data” directive:

!$acc host_data use_device(X)
i=cuda_memcpy_c2fort_real(X,XP,n*8,2) 
!$acc end host_data

Mat

emath · February 17, 2016, 1:17pm

Thank you Mat for your reply.
A demo version of the code eliminating cuSPARSE calls can be

        use cublas
        use cudafor
        use openacc
        implicit real*8 (a-h,o-z)
        parameter (n=1000)
        real*8 x(n)
        integer*8 xp
 
        do i=1,n
          x(i)=dfloat(i)
        enddo
        is=cuda_malloc(xp,n*8)
        is=cuda_memcpy_fort2c_real(xp,x,n*8,1)
        call cublas_dscal(n,5.0d0,xp,1)
        print*,cublas_dnrm2(n,xp,1)
C -------  OpenACC  Region -----------------------------
        is=cuda_memcpy_c2fort_real(x,xp,n*8,2)
!$acc parallel deviceptr(x)          
          x(:)=x(:)*x(:)
!$acc end parallel 
        is= cuda_memcpy_fort2c_real(xg,x,n*8,1)
C-----    End for OpenACC  Region  --------------------
        call cublas_dscal(n,3.0d0,xp,1)
        is=cuda_memcpy_c2fort_real(x,xp,n*8,2)
        stop
        end

As you advise me I added the deviceptr clause in the acc parallel directive, but I get the execution error

Failing in Tread:1
call to cuStreamSyncgronize returned error 700: Illegal address during kernel execution

The compile parameters for 16.1 version are
pgf90 -mcmodel=medium -mp -O4 -acc -Mlarge_arrays -D_CUDAFOR -Minfo=accel -Mcuda=7.0,cc2x
with libraries
-L/opt/pgi/linux86-64/2016/cuda/7.0/lib64 -lcudart -lcublas -lcusparse

Without the deviseptr I get the correct values and I do not know why.
The two cuda_memcpy inside the OpenACC region causes an execution delay ? if so is it possible to avoid them ? or this the optimal way to use openacc with cuBLAS and cuSPARSE at the same time ?

MatColgrove · February 17, 2016, 8:37pm

Hi emath,

This is because “x” isn’t a device pointer. “xp” is the device pointer, “x” is a host array that you’re copying data back and forth to the device copy. In this case you’d want to use the OpenACC “acc_map_data” routine to associate x with xp on the device.

Though using straight CUDA Fortran instead of these cuBlas Fortran interface routines, is going to be much simpler and more portable. The following code can be used for both GPUs and CPUs depending upon which flags you use. It could be cleaned up a bit so you don’t have the extra copies in the CPU version.

% cat cblas.F90

#ifdef _CUDA
         use cublas
         use cudafor
         use openacc
#define DEVICE ,device
#else
#define DEVICE
#endif
         implicit real(8) (a-h,o-z)
         parameter (n=1000)
         real(8) :: x(n)
         real(8) DEVICE :: xD(n)

         do i=1,n
           x(i)=dfloat(i)
         enddo
         xD=x
         call dscal(n,5.0d0,xD,1)
         print*,dnrm2(n,xD,1)
!$acc parallel
           xD(:)=xD(:)*xD(:)
!$acc end parallel
         call dscal(n,3.0d0,xD,1)
         x=xD
         print *, x(100)
         stop
         end

% pgfortran cblas.F90 -lblas -llapack -o cpu.out
% pgfortran cblas.F90 -Mcuda -acc -Mcudalib=cublas -o gpu.out
% ./cpu.out
    91355.55538663207
    750000.0000000000
Warning: ieee_inexact is signaling
FORTRAN STOP
% ./gpu.out
    91355.55538663207
    750000.0000000000
Warning: ieee_inexact is signaling
FORTRAN STOP

Hope this helps,
Mat

emath · February 19, 2016, 1:11pm

Mat, you are right about using straight CUDA Fortran in case of using BLAS routines. But I have also to perform sparse matrix vector multiplications. In other words I have also to use the cuSPARSE library. This is the reason of using the cuBLAS and cuSPARSE Fortran interface.
It looks that the acc_map_data subroutine will give me the ability to use OpenACC directives. I have tried to use it in the following demo code

        use cublas
        use cudafor
        use openacc
        implicit real*8 (a-h,o-z)
        parameter (n=1000)
        real*8 :: x(n)
        integer*8 xp
 
        do i=1,n
          x(i)=dfloat(i)
        enddo
        print*,dnrm2(n,x,1)
        is=cuda_malloc(xp,n*8)
        call acc_map_data(x,xp,n*8)
        is=cuda_memcpy_fort2c_real(xp,x,n*8,1)
        call cublas_dscal(n,5.0d0,xp,1)
        print*,cublas_dnrm2(n,xp,1)
CCCCCCC  OpenACC  Region  CCCCCCCCCCCCCCCCCCCCCC
c        is=cuda_memcpy_c2fort_real(x,xp,n*8,2)
!$acc parallel 
          x(:)=x(:)*x(:)
!$acc end parallel 
c       is= cuda_memcpy_fort2c_real(xp,x,n*8,1)
CCCCCCC  End for OpenACC  Region  CCCCCCCCCCCCCCCCCCCCCC
        call cublas_dscal(n,3.0d0,xp,1)
        print*,cublas_dnrm2(n,xp,1)
        is=cuda_memcpy_c2fort_real(x,xp,n*8,2)
        print*,x(100)
        stop
        end

but I get the compilation error

PGF90-S-0155-Could not resolve generic procedure acc_map_data

A special compiler option is needed ? more than -acc in order to include this API subroutine. Do you think that this is the right use of it and I can remove the to cud_memcpy calls from the OpenACC region ?

MatColgrove · February 19, 2016, 5:55pm

My fault. acc_map_data is C only. Let’s try a different tack and use OpenACC to manage your data. Using the “host_data” directive, we can pass the device pointer to the cuBlas routines. I’m still going to use the generic interfaces instead of “cublas_…” since I would need to write my own interfaces for these.

On a side note, we do ship modules for cuSparse that you can use. The modules are included with the CUDA include installation that ship with the compilers. The interface did change from release to release but the compiler will use the correct one depending upon which CUDA version you’re using. I’d recommend using CUDA 7.5 (-Mcuda=cuda7.5) or later since there were some bugs in cuSparse that have been fixed.

Also for array syntax, use “kernels” instead of “parallel” since “kernels” allows the compiler to discover parallelism so will parallelize the implicit loop. With “parallel”, you’re telling the compiler which loops to parallelize. Since there’s no loop here, there’s no parallelism used.

% cat test.f
#ifdef _CUDA
         use cublas
         use cudafor
         use openacc
#endif
         implicit real*8 (a-h,o-z)
         parameter (n=1000)
         real*8 :: x(n)

!$acc enter data create(x)
!$acc kernels loop
         do i=1,n
           x(i)=dfloat(i)
         enddo
!$acc host_data use_device(x)
         print*,dnrm2(n,x,1)
         call dscal(n,5.0d0,x,1)
         print*,dnrm2(n,x,1)
!$acc end host_data

!$acc kernels present(x)
           x(:)=x(:)*x(:)
!$acc end kernels
!$acc host_data use_device(x)
         call dscal(n,3.0d0,x,1)
         print*,dnrm2(n,x,1)
!$acc end host_data
!$acc update host (x)
         print*,x(100)
!$acc exit data delete(x)
         stop
         end

Compile for the host:

% pgf90 test.f -Mpreprocess -lblas -llapack; a.out
    18271.11107732642
    91355.55538663207
    1061986052.168206
    750000.0000000000
Warning: ieee_inexact is signaling
FORTRAN STOP

Compile for the GPU

% pgf90 -acc -Mcuda test.f -Mcudalib=cublas -Mpreprocess ; a.out
    18271.11107732642
    91355.55538663207
    1061986052.168206
    750000.0000000000
Warning: ieee_inexact is signaling
FORTRAN STOP

Hope this helps,
Mat

emath · February 20, 2016, 8:08pm

It will be a great help for scientists, if PGI creates a cuSPARSE module as the one for cuBLAS and give the opportunity to everyone to write code for CPUGPU applications in Fortran.
Thank you again for your useful comments.

-Manolis

MatColgrove · February 22, 2016, 5:56pm

Hi Manolis,

It will be a great help for scientists, if PGI creates a cuSPARSE module as the one for cuBLAS

As I note above, we do ship a cuSparse module. It’s located with the CUDA libraries and tools that accompany the PGI compilers. Ideally we could put in the PGI include directory with cuBlas, but since the cuSparse interfaces has changed between different CUDA releases, we had to have different modules depending on the version of cuSPARSE.

Mat