I am trying to perform a vector operation where the transpose of a vector is multiplied by another vector.

I have tried with CUF kernels:

```
!Subroutine ScalarDivVecDotVec(alpha,P,Q,eCG,nTot)
Implicit None
! Local Vars
Integer:: i, j
Real(fp_kind), Device:: PdotQ
! Passed Vars
Integer, Value:: nTot
Real(fp_kind), Device, Intent(IN):: P(3*nTot), Q(3*nTot), eCG
Real(fp_kind), Device, Intent(INOUT):: alpha
!$cuf kernel do(1) <<<*,*>>>
Do i = 1, nTot
PdotQ = PdotQ + P(i)*Q(i)
End Do
alpha = eCG/PdotQ
End Subroutine
```

This gave me an error saying that “more than one resident device variable”

Then I tried to use an atomicadd:

```
Attributes(Global) Subroutine ScalarDivVecDotVecGPU(alpha,P,Q,eCG,nTot)
Implicit None
! Local Vars
Integer:: i, j, istat
Real(fp_kind), Device:: PdotQ
! Passed Vars
Integer, Value:: nTot
Real(fp_kind), Device, Intent(IN):: P(3*nTot), Q(3*nTot), eCG
Real(fp_kind), Device, Intent(INOUT):: alpha
i = (blockIdx%x-1)*blockDim%x + threadIdx%x
If (i >= 1 .and. i <= nTot) Then
istat = atomicadd(PdotQ,P(i)*Q(i)) !PdotQ = PdotQ + P(i)*Q(i)
alpha = eCG/PdotQ
End If
End Subroutine
```

This would not compile (some undefined error)

I would have thought that the CUF kernel could do this since I can succesfully do:

```
Subroutine VecdotVec(V,eCG,nTot)
Implicit None
! Local Vars
Integer:: i, j
! Passed Vars
Integer, Value:: nTot
Real(fp_kind), Device, Intent(IN):: V(3*nTot)
Real(fp_kind), Device, Intent(INOUT):: eCG
!$cuf kernel do(1) <<<*,*>>>
Do i = 1, nTot
eCG = eCG + V(i)*V(i)
End Do
End Subroutine
```

Is there a way to do this easily on the GPU? I would prefer to stay away for cuBlas and so on for now…

Any help is greatly appreciated,

Kirk