scalar operation on device

Hi there,

I have started with cuda fortran for gpu programming and got stuck. In a simple program I have the following sequence of operations:

a_1) y=cusparse matrix M by vector y
a_2) x=cusparse matrix M by vector x
b_1) i=cublas dot product of y
b_2) j=cublas dot product of x
c) k=i/j
The program looks like

Program Test
  use cusparse
  use cublas
  use cudafor
  Implicit None
  type(cusparseHandle) :: handle
  type(cublashandle) :: handleblas
  type(cusparseMatDescr) :: descrA
  Integer, allocatable, dimension(:) :: colindex, rowindex
  real, allocatable, dimension(:) :: values, x,y
  real,device,allocatable,dimension(:) :: dvalues, dx0,dx1, dy0,dy1
  Integer, device, allocatable, dimension(:) :: dcolindex, drowindex
  real,device :: alpha, beta, di, dj, dk
  integer :: status
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  !! |1.5; 0.0; 3.5 |
  !! |0.0; 1.3; 2.1 |
  !! |3.3; 0.0; 0.0 |
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  Allocate(values(5),colIndex(5),rowindex(4),x(3),y(3))
  Allocate(dvalues(5),dcolIndex(5),drowindex(4),dx0(3),dy0(3),dx1(3),dy1(3))
  values=(/1.5,3.5,1.3,2.1,3.3/)
  colindex=(/1,3,2,3,1/)
  rowindex=(/1,3,5,6/)
  x=1.0;y=1.0
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  status=cusparseCreate(handle)
  status=cublasCreate(handleblas)
  status = cublasSetPointerMode(handleblas, CUBLAS_POINTER_MODE_DEVICE)
  status=cusparseCreateMatDescr(descrA)
  status = cusparseSetMatType(descrA,0) !!general matrix
  status = cusparseSetMatIndexBase(descrA,1) !!indexing starts with 1
  status = cusparseSetMatDiagType(descrA,0) !!diagonal non-unity
  status = cusparseSetMatFillMode(descrA,0)
  !!
  status=cublasinit()
  dvalues=values
  drowindex=rowindex
  dcolindex=colindex
  dx0=x
  dy0=y
  !!multiply
  status = cusparseScsrmv(handle=handle, trans=0, m=3, n=3, nnz=5, &
       alpha=1.0, descr=descrA, csrVal=dvalues, csrRowPtr=drowindex,&
       & csrColInd=dcolindex, x=dx0, beta=0.0, y=dx1)
  status = cusparseScsrmv(handle=handle, trans=0, m=3, n=3, nnz=5, &
       alpha=1.0, descr=descrA, csrVal=dvalues, csrRowPtr=drowindex,&
       & csrColInd=dcolindex, x=dy0, beta=0.0, y=dy1)
  status=cublasSdot_v2(handleblas,3,dx1,1,dx1,1,di)
  status=cublasSdot_v2(handleblas,3,dy1,1,dy1,1,dj)
  dk=di/dj
End Program Test

I got steps a and b running, which is the whole program but the last line but not step c, which is the last line. di, dj and dk are all on device. Compiling gives the error “More than one device-resident object in assignment”. Putting the last line in an openacc environment (!$acc parallel present(di,dj,dk) increased the runtime for a larger example from 10^(-5) seconds to 10^(-2).

What am I doing wrong here??

Thanks a lot

Hi vb86,

which is the whole program but the last line but not step c, which is the last line. di, dj and dk are all on device. Compiling gives the error “More than one device-resident object in assignment”

The error is correct in that you can’t perform operations on device data when executing on the host. You need to either copy the arrays back from the host or perform the operation on the device. To do that you’ll either need to write a device kernel or put the operation in an OpenACC compute region.

Putting the last line in an openacc environment (!$acc parallel present(di,dj,dk) increased the runtime for a larger example from 10^(-5) seconds to 10^(-2).

Use “!$acc kernels” when there’s an implicit array-syntax loop. One major difference between “kernels” and “parallel” is the with Kernels the compiler finds and exploits the available parallelism while with parallel you’re telling the compiler where the parallelism is. Since there’s no explicit loop, there’s no place to tell the compiler which loop to parallelize and hence you’re running this sequentially on the device.

!$acc kernels
dk=di/dj 
!$acc end kernels

Not that using “present” isn’t correct here since there’s no host copy (The compiler is ignoring it). “deviceptr” is closer but isn’t necessary since the compiler knows that these variable have the device attribute so does the correct thing.

Hope this helps,
Mat

Hi there,

thanks for the help. It worked. However, I now experience a segmentation fault when cublas functions write into variables for which an accelerator kernel has been generated. See the code below:

Program Test
  use cusparse
  use cublas
  use cudafor
  Implicit None
  type(cusparseHandle) :: handle
  type(cublashandle) :: handleblas
  type(cusparseMatDescr) :: descrA
  Integer, allocatable, dimension(:) :: colindex, rowindex
  real, allocatable, dimension(:) :: values, x,y
  real,device,allocatable,dimension(:) :: dvalues, dx0,dx1, dy0,dy1
  Integer, device, allocatable, dimension(:) :: dcolindex, drowindex
  real,device :: alpha, beta, di, dj, dk
  real :: t1, t0
  integer :: status
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  !! |1.5; 0.0; 3.5 |
  !! |0.0; 1.3; 2.1 |
  !! |3.3; 0.0; 0.0 |
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  Allocate(values(5),colIndex(5),rowindex(4),x(3),y(3))
  Allocate(dvalues(5),dcolIndex(5),drowindex(4),dx0(3),dy0(3),dx1(3),dy1(3))
  values=(/1.5,3.5,1.3,2.1,3.3/)
  colindex=(/1,3,2,3,1/)
  rowindex=(/1,3,5,6/)
  x=1.0;y=1.0
  !!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
  status=cusparseCreate(handle)
  status=cublasCreate(handleblas)
  status = cublasSetPointerMode(handleblas, CUBLAS_POINTER_MODE_DEVICE)
  status=cusparseCreateMatDescr(descrA)
  status = cusparseSetMatType(descrA,0) !!general matrix
  status = cusparseSetMatIndexBase(descrA,1) !!indexing starts with 1
  status = cusparseSetMatDiagType(descrA,0) !!diagonal non-unity
  status = cusparseSetMatFillMode(descrA,0)
  !!
  status=cublasinit()
  dvalues=values
  drowindex=rowindex
  dcolindex=colindex
  dx0=x
  dy0=y
  !!multiply
  status = cusparseScsrmv(handle=handle, trans=0, m=3, n=3, nnz=5, &
       alpha=1.0, descr=descrA, csrVal=dvalues, csrRowPtr=drowindex,&
       & csrColInd=dcolindex, x=dx0, beta=0.0, y=dx1)
  status = cusparseScsrmv(handle=handle, trans=0, m=3, n=3, nnz=5, &
       alpha=1.0, descr=descrA, csrVal=dvalues, csrRowPtr=drowindex,&
       & csrColInd=dcolindex, x=dy0, beta=0.0, y=dy1)
  status=cublasSdot_v2(handleblas,3,dx1,1,dx1,1,di)
  status=cublasSdot_v2(handleblas,3,dy1,1,dy1,1,dj)
  status=cublasSdot_v2(handleblas,3,dy1,1,dy1,1,dk)
  !$acc kernels
  dk=di/dj
  !$acc end kernels
End Program Test

The issue is in the last line before the !$acc region. This line simply repeats the operation above it, but writes into dk. dk is used in the acc region. Running the executable results in a segmentation fault. Commenting either the last cublasSdot call or adding another variable “dl”, which replaces “dk” in the acc region will make the exe running without faults.

Note that the seg fault also occurs when the last cublas call is placed below the acc region.

This seem unlogical for me. Am I getting something wrong here??

Thanks you very much.

Hi vb86,

I’m thinking that compiler might be missing that these are device scalars and trying to copy them in. Let’s add a “present” clause to tell the compiler that they are already there.

   !$acc kernels present(dk,di,dj)
   dk=di/dj
   !$acc end kernels
  • Mat