Hi there,
I have started with cuda fortran for gpu programming and got stuck. In a simple program I have the following sequence of operations:
a_1) y=cusparse matrix M by vector y
a_2) x=cusparse matrix M by vector x
b_1) i=cublas dot product of y
b_2) j=cublas dot product of x
c) k=i/j
The program looks like
Program Test
use cusparse
use cublas
use cudafor
Implicit None
type(cusparseHandle) :: handle
type(cublashandle) :: handleblas
type(cusparseMatDescr) :: descrA
Integer, allocatable, dimension(:) :: colindex, rowindex
real, allocatable, dimension(:) :: values, x,y
real,device,allocatable,dimension(:) :: dvalues, dx0,dx1, dy0,dy1
Integer, device, allocatable, dimension(:) :: dcolindex, drowindex
real,device :: alpha, beta, di, dj, dk
integer :: status
!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
!! |1.5; 0.0; 3.5 |
!! |0.0; 1.3; 2.1 |
!! |3.3; 0.0; 0.0 |
!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Allocate(values(5),colIndex(5),rowindex(4),x(3),y(3))
Allocate(dvalues(5),dcolIndex(5),drowindex(4),dx0(3),dy0(3),dx1(3),dy1(3))
values=(/1.5,3.5,1.3,2.1,3.3/)
colindex=(/1,3,2,3,1/)
rowindex=(/1,3,5,6/)
x=1.0;y=1.0
!!@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
status=cusparseCreate(handle)
status=cublasCreate(handleblas)
status = cublasSetPointerMode(handleblas, CUBLAS_POINTER_MODE_DEVICE)
status=cusparseCreateMatDescr(descrA)
status = cusparseSetMatType(descrA,0) !!general matrix
status = cusparseSetMatIndexBase(descrA,1) !!indexing starts with 1
status = cusparseSetMatDiagType(descrA,0) !!diagonal non-unity
status = cusparseSetMatFillMode(descrA,0)
!!
status=cublasinit()
dvalues=values
drowindex=rowindex
dcolindex=colindex
dx0=x
dy0=y
!!multiply
status = cusparseScsrmv(handle=handle, trans=0, m=3, n=3, nnz=5, &
alpha=1.0, descr=descrA, csrVal=dvalues, csrRowPtr=drowindex,&
& csrColInd=dcolindex, x=dx0, beta=0.0, y=dx1)
status = cusparseScsrmv(handle=handle, trans=0, m=3, n=3, nnz=5, &
alpha=1.0, descr=descrA, csrVal=dvalues, csrRowPtr=drowindex,&
& csrColInd=dcolindex, x=dy0, beta=0.0, y=dy1)
status=cublasSdot_v2(handleblas,3,dx1,1,dx1,1,di)
status=cublasSdot_v2(handleblas,3,dy1,1,dy1,1,dj)
dk=di/dj
End Program Test
I got steps a and b running, which is the whole program but the last line but not step c, which is the last line. di, dj and dk are all on device. Compiling gives the error “More than one device-resident object in assignment”. Putting the last line in an openacc environment (!$acc parallel present(di,dj,dk) increased the runtime for a larger example from 10^(-5) seconds to 10^(-2).
What am I doing wrong here??
Thanks a lot