Vector transpose times a vector

I am trying to perform a vector operation where the transpose of a vector is multiplied by another vector.

I have tried with CUF kernels:

!Subroutine ScalarDivVecDotVec(alpha,P,Q,eCG,nTot)

	Implicit None

	! Local Vars
	Integer:: i, j
	Real(fp_kind), Device:: PdotQ

	! Passed Vars
	Integer, Value:: nTot
	Real(fp_kind), Device, Intent(IN):: P(3*nTot), Q(3*nTot), eCG
	Real(fp_kind), Device, Intent(INOUT):: alpha

	!$cuf kernel do(1) <<<*,*>>>
	Do i = 1, nTot
		PdotQ = PdotQ + P(i)*Q(i)
        End Do
	alpha = eCG/PdotQ
		
End Subroutine

This gave me an error saying that “more than one resident device variable”

Then I tried to use an atomicadd:

Attributes(Global) Subroutine ScalarDivVecDotVecGPU(alpha,P,Q,eCG,nTot)

	Implicit None

	! Local Vars
	Integer:: i, j, istat
	Real(fp_kind), Device:: PdotQ

	! Passed Vars
	Integer, Value:: nTot
	Real(fp_kind), Device, Intent(IN):: P(3*nTot), Q(3*nTot), eCG
	Real(fp_kind), Device, Intent(INOUT):: alpha

	i = (blockIdx%x-1)*blockDim%x + threadIdx%x

	If (i >= 1 .and. i <= nTot) Then
		istat = atomicadd(PdotQ,P(i)*Q(i))   !PdotQ = PdotQ + P(i)*Q(i)
		alpha = eCG/PdotQ
	End If
		
End Subroutine

This would not compile (some undefined error)

I would have thought that the CUF kernel could do this since I can succesfully do:

Subroutine VecdotVec(V,eCG,nTot)

	Implicit None

	! Local Vars
	Integer:: i, j

	! Passed Vars
	Integer, Value:: nTot
	Real(fp_kind), Device, Intent(IN):: V(3*nTot)
	Real(fp_kind), Device, Intent(INOUT):: eCG

	!$cuf kernel do(1) <<<*,*>>>
	Do i = 1, nTot
		eCG = eCG + V(i)*V(i)
	End Do
		
End Subroutine

Is there a way to do this easily on the GPU? I would prefer to stay away for cuBlas and so on for now…

Any help is greatly appreciated,

Kirk

Well…

Simple enough to fix. the problem was that I was trying to perform

alpha = eCG/PdotQ

on the host (all three are device variables). Nothing to do with the cuf kernel after all.

Easy enough to fix.