I am trying to perform a vector operation where the transpose of a vector is multiplied by another vector.
I have tried with CUF kernels:
!Subroutine ScalarDivVecDotVec(alpha,P,Q,eCG,nTot)
Implicit None
! Local Vars
Integer:: i, j
Real(fp_kind), Device:: PdotQ
! Passed Vars
Integer, Value:: nTot
Real(fp_kind), Device, Intent(IN):: P(3*nTot), Q(3*nTot), eCG
Real(fp_kind), Device, Intent(INOUT):: alpha
!$cuf kernel do(1) <<<*,*>>>
Do i = 1, nTot
PdotQ = PdotQ + P(i)*Q(i)
End Do
alpha = eCG/PdotQ
End Subroutine
This gave me an error saying that “more than one resident device variable”
Then I tried to use an atomicadd:
Attributes(Global) Subroutine ScalarDivVecDotVecGPU(alpha,P,Q,eCG,nTot)
Implicit None
! Local Vars
Integer:: i, j, istat
Real(fp_kind), Device:: PdotQ
! Passed Vars
Integer, Value:: nTot
Real(fp_kind), Device, Intent(IN):: P(3*nTot), Q(3*nTot), eCG
Real(fp_kind), Device, Intent(INOUT):: alpha
i = (blockIdx%x-1)*blockDim%x + threadIdx%x
If (i >= 1 .and. i <= nTot) Then
istat = atomicadd(PdotQ,P(i)*Q(i)) !PdotQ = PdotQ + P(i)*Q(i)
alpha = eCG/PdotQ
End If
End Subroutine
This would not compile (some undefined error)
I would have thought that the CUF kernel could do this since I can succesfully do:
Subroutine VecdotVec(V,eCG,nTot)
Implicit None
! Local Vars
Integer:: i, j
! Passed Vars
Integer, Value:: nTot
Real(fp_kind), Device, Intent(IN):: V(3*nTot)
Real(fp_kind), Device, Intent(INOUT):: eCG
!$cuf kernel do(1) <<<*,*>>>
Do i = 1, nTot
eCG = eCG + V(i)*V(i)
End Do
End Subroutine
Is there a way to do this easily on the GPU? I would prefer to stay away for cuBlas and so on for now…
Any help is greatly appreciated,
Kirk