Thanks, that is very helpful. Let me clarify a bit. I initially replaced all computational calls in step 2a by kernel launches (either explicit with $cuf kernel do or implicitly by calling cublas/cufft) with explicit memcopies between them (so I replaced “do on cpu” by “memcopy to device; do on gpu; memcopy to host”). This works (gave the correct result) but is very slow. Profiling immediately makes it clear that the memcopies are the issue. Then I started thinking about eliminating those, using a structure like “memcopy to device; do computation A on GPU; do computation B on GPU; …; memcopy to host” - so removing all except the first and last memcopy. This bring up the question: if I use “!$cufor kernel do” and in the loop only device variables appear, does the compiler “know” this and skip the steps of allocating and transferring variables on the device? Same for the call to cufft.

Also, is there a benefit to writing my own device subroutines as opposed to using “!$cuf kernel do” directives in terms of speed?

The problem I work on is one of manipulating (DFTing, summing, element-wise multiplying) arrays of size 10^5-10^7 of complex double type.

To make things a bit more concrete, consider a cpu code like this:

subroutine NLT(a,ndat,n)

use parameters ! module that contains the constant vectors w1, w2 (complex double of lenght n)

use fourier ! module that contains the details of the DFT like the “hand”

integer :: i,ndat,n

complex(8) :: a(ndat),b(ndat),c(ndat)

do i=1,n

b=w1 * a

fft_status=DftiComputeBackward(hand, b, c)

b=c**2

fft_status=DftiComputeForwards(hand,b,c)

a=a+w2 * c

enddo

end subroutine NLT

My first attempt looked like this:

subroutine NLT(a,ndat,n)

use parameters

use fourier

integer i,j,ndat,n

complex(8) :: a(ndat),b(ndat),c(ndat)

complex(8), device :: ad(ndat),bd(ndat),cd(ndat)

do i=1,n

!$cuf kernel do

do j=1,ndat

b(j)=w1(j) * a(j)

enddo

bd=b

fft_status = cufftExecZ2Z(plan, bd, cd, CUFFT_INVERSE)

c=cd

!$cuf kernel do

do j=1,ndat

b(j)=c(j)**2

enddo

bd=b

cufftExecZ2Z(plan, bd, cd, CUFFT_FORWARD)

c=cd

!$cuf kernel do

do j=1,ndat

a(j)=a(j)+w2(j) * c(j)

enddo

enddo

end subroutine NLT

(note that is not actual code - at least not stand-alone)

The module “fourier” has been changed to use the cufft library instead of the MKL library and there are memcopies before and after each computation on the GPU. That slows things down with respect to the cpu code. The result is correct.

My best guess is that the following should eliminate all except two memcopies:

subroutine NLT(a,ndat,n)

use parameters

use fourier

integer i,j,ndat,n

complex(8) :: a(ndat),b(ndat),c(ndat)

complex(8), device :: ad(ndat),bd(ndat),cd(ndat)

ad=a ! initial memcopy H2D

do i=1,n

!$cuf kernel do

do j=1,ndat

bd(j)=w1d(j) * ad(j)

enddo

fft_status = cufftExecZ2Z(plan, bd, cd, CUFFT_INVERSE)

!$cuf kernel do

do j=1,ndat

bd(j)=cd(j)**2

enddo

cufftExecZ2Z(plan, bd, cd, CUFFT_FORWARD)

!$cuf kernel do

do j=1,ndat

ad(j)=ad(j)+w2d(j) * cd(j)

enddo

enddo

a=ad ! final memcopy D2H

end subroutine NLT

(here w1d and w2d are device copies of w1 and w2 computed elsewhere)

would that work and avoid all memcopies inside the loop over index i? Or is the design flawed?

PS. while I am at it, let me add one more question. The “kernel do” directive will result in execution of the loop iterations in random order, but is the order of operations specified inside the loop guaranteed? For instance, is this safe (see below) or can the order in which the two lines inside the loop are executed be interchanged for some i?

!$cuf kernel do

do i=1,ndat

a(i)=a(i)**2

b(i)=a(i)*w1d(i)

enddo