Cupy and loops

I am wondering if I can stick to cupy or if there is a better way (which is suspect):

I wrote an iterative algorithm that involves in every step a couple of FFTs:

def singleReconstructionStep(diffMag, fPhases, support):
fguess = diffMagcp.exp(1jfPhases)
realguess =cp.fft.ifftshift(cp.fft.ifft2(cp.fft.fftshift(fguess)))
realstep = cp.multiply(realguess,support)
fstep = cp.fft.fftshift(cp.fft.fft2(cp.fft.ifftshift(realstep)))
return fstep , realstep , realguess

using cupy instead of numpy already gave me a speedup of ~5x
I repeat this step ~100k times :

for i in range(200000):
phases = cp.angle(dStep)
dStep , realStep , realGuess = singleReconstructionStep(magnitudeFromDiffraction,phases,support)

I would suspect it should be much more efficient to also perform this loop on the GPU to avoid moving data between GPU and host. Is this possible just with cupy and if now, what would be the best way?

If the work items in your for loop are independent, then you could proceed in a data-parallel way, by creating a single array of the extent of the loop (so ~100K) and issue the fft work on that array.

Your for loop body doesn’t seem to depend on the i variable, and I’m not able to work with a snippet of code, so that is about as much as I can say. It appears that you may have a loop-carried dependence in dStep which would violate the first sentence in my post here. In that case, I have no suggestions. It’s much harder to handle loop-carried dependencies.

Thanks for this answer. As you noted, each step depends on the result of the previous (but not earlier on) and some constant data structures. What would the right approach be to handle theses “loop-carried dependencies”?
Sorry, i did not see the option to put have a code block before, let me try again, the snippet included everything except for initialization.


def singleReconstructionStep(diffMag, fPhases, support):
    fguess    = diffMag*cp.exp(1j*fPhases)
    realguess = cp.fft.ifftshift(cp.fft.ifft2(cp.fft.fftshift(fguess)))
    realstep  = cp.multiply(realguess,support)
    fstep     = cp.fft.fftshift(cp.fft.fft2(cp.fft.ifftshift(realstep)))
    return fstep , realstep , realguess
    

for i in range(200000):
    phases = cp.angle(dStep)
    dStep , realStep , realGuess = singleReconstructionStep(magnitudeFromDiffraction,phases,support)

I don’t know of a single approach or heuristic to handle a loop carried dependency. As I mentioned, it makes parallelization more difficult, and if any parallelism is still possible, the parallelization method is almost certainly algorithm and possibly data dependent. Usually the approach to handle a loop-carried dependence is to study the algorithm, see if it fits a known pattern, and use a methodology which somebody has already figured out for that type of algorithm/pattern. There is no “one-size-fits-all” approach that I am aware of. If you want to see a completely different, irrelevant example of a loop carried dependence and the methodology to sort it out in parallel, you can read this paper by someone much smarter than me, and if you want this problem that I worked on recently, which is related. But it’s just an irrelevant example. I don’t think it will help you.

Is there any literature or well-known methods to solve whatever problem you are solving, in parallel?

Maybe there is a misunderstanding here: I am not trying to parallelize the iteration steps of the loop. The power of the GPU applied in the FFTs in each step is enough. However, I suspected that in the way it is done so far, in each iteration of the the loop, data needs to be transferred to the GPU and then transferred back from its memory. And I would like to avoid that and keep it there all the time if possible. Maybe that is just not what is happening, I had hoped someone with more experience with cupy could tell me…

I do not know cuPy, and I can merely guess what the non-FFT parts of this code are doing. To avoid copying data between host and device each time through the loop, you would want to move all computation inside the loop to the GPU.

I am guessing that cp.multiply, cp.exp, and cp.angle are simple element-wise scaling, complex exponential, and atan2 operations on 1D arrays? If so, and according to the cuPy documentation, you should be able to write element-wise kernels using cupy.ElementwiseKernel that perform this processing.

I don’t have the motivation to guess or deduce which arguments are numpy arrays, cupy arrays, or python scalars, and that certainly matters to answer the question. If everything is cupy arrays, then for the most part a sequence of cupy operations should involve no significant transfer of data between host and device.

With a complete example you could make this determination, and/or you could run a very simple profiler test to observe what is happening exactly at each loop iteration. Even if I had a complete example, and I was concerned about correctness, I would probably still run the profiler test. I make mistakes. For the most part, the profiler does not.

Sorry, I had hoped leaving the trivial lines out would make the example better readable and understandable. (working example will follow)

Does that mean as soon as a cupy object is initialized, its content will remain on the graphics card? Probably the issue here is that I do not quite understand how cupy objects handle data.

As far as I can see all operations are either cupy ( cp.fft/.multiply etc) operations or multiplications between cupy arrays.
Everything is cupy arrays.

I tried to use nvprof but that did not give me any information whatsoever, what did you have in mind for profiling (in coarse terms)?

As I said, I don’t know cuPy. I have never used cuPy. The most I can do is read documentation and speculate. I would suggest asking about this in a forum focused on cuPy, not a forum focused on CUDA.

I wonder if “Everything” actually means everything.

Then the only thing I would be concerned about is this:

diffMag*cp.exp(1j*fPhases)

I don’t remember exactly what cupy does with that. I think it is smart enough to figure out how to do that on the device without copy traffic.

I had in mind nvprof for GPUs that are pascal or prior, and nsight systems for GPUs that are of the volta generation or newer. nvprof actually works well on volta also, and my “workhorse” is a V100, so I often use nvprof with it. For Turing or Ampere, use nsight systems.

Well its only 4 arrays so I am quite confident that even I did not overlook one ; )

Ok, I will read more documentation and try to get a grip on how to use the profiler. I came to the more generic forum the hope that if something more generic is needed for this task chances are higher someone might know.

Thanks you both. (next post will be a complete working example, just for completeness sake)


import numpy as np
import cupy as cp
import scipy as sp
import scipy.ndimage

import matplotlib.pyplot as plt
import matplotlib.colors as colors



def plot2SideBySide(im1, im2, log1=False):
  fig = plt.figure(figsize = (11,4))
  ax1 = fig.add_subplot(121)
  if log1:
      ax1.imshow(im1,norm=colors.LogNorm(vmin=0.00000001, vmax=30000), cmap='viridis')
  else:
    ax1.imshow(im1,cmap='viridis')
  ax2 = fig.add_subplot(122)
  im_out = ax2.imshow(im2,norm=colors.LogNorm(vmin=0.00000001,vmax=30000),cmap='plasma')
  plt.show()


def plotSideBySide(im1, im2, log1=False):

  fig , (ax1,ax2) = plt.subplots(1,2)

  ax1.imshow(im1, cmap='viridis')
  ax2.imshow(im2,cmap='viridis')

  plt.show()

def getDoubleSquare():
    shape = np.zeros((200,200))
    shape[90:100,90:100] = 1
    shape[100:110,100:110] = 1
    shape_smooth = sp.ndimage.gaussian_filter(shape,2)
    return cp.array(shape_smooth)

def zeropadding(image,paddingFactor):

  if paddingFactor > 1:
    originalSizes = np.array(image.shape)
    paddingFactor = paddingFactor
    paddedSizes = np.array((paddingFactor*(originalSizes)),dtype=int)+np.array([1,1])


    offsets = np.array((paddedSizes-originalSizes)/2, dtype=int)
    zeropaddedImage = cp.zeros(paddedSizes)

    zeropaddedImage[offsets[0]:offsets[0]+originalSizes[0],offsets[1]:offsets[1]+originalSizes[1] ] = image
    return zeropaddedImage

  return image


def getRadiiMap(shape, center):
  x , y = cp.indices(shape)
  return cp.sqrt((x-center[-1])**2 + (y-center[1])**2)

def getRadialMask(shape, radius):
  mask           = cp.ones(shape)
  center         = (cp.floor(shape[0]/2),cp.floor(shape[1]/2))
  radiiMap       = getRadiiMap(shape,center)

  mask[radiiMap > radius] = 0

  return cp.array(sp.ndimage.gaussian_filter(mask.get(),2))



def singleReconstructionStep(diffMag, fPhases, support):
    fguess    = diffMag*cp.exp(1j*fPhases)
    realguess =cp.fft.ifftshift(cp.fft.ifft2(cp.fft.fftshift(fguess)))
    realstep  = cp.multiply(realguess,support)
    fstep     = cp.fft.fftshift(cp.fft.fft2(cp.fft.ifftshift(realstep)))
    return fstep , realstep , realguess


def singleMaskedReconstructionStep(diffMag, fMags,fPhases, support, mask ):
    fguess = (cp.multiply(diffMag, mask)+ cp.multiply(fMags, -(mask-1) )) *cp.exp(1j*fPhases)

    realguess =cp.fft.ifftshift(cp.fft.ifft2(cp.fft.fftshift(fguess)))
    realstep = cp.multiply(realguess,support)
    fstep = cp.fft.fftshift(cp.fft.fft2(cp.fft.ifftshift(realstep)))
    return fstep, fguess, realstep , realguess



shape = getDoubleSquare()

diffraction = cp.abs(np.fft.fftshift(cp.fft.fftn(shape)))**2

half_gap_px = 6
diffraction_gap = diffraction.copy()
center = (int(diffraction.shape[0]/2),int(diffraction.shape[1]/2))
diffraction_gap[center[0]-half_gap_px:center[0]+half_gap_px, :] =0.0000000000000001


mask = cp.ones(diffraction_gap.shape)
mask[center[0]-half_gap_px:center[0]+half_gap_px, :] =0


autocorr = cp.fft.fftn(diffraction_gap)


radius = 25
support  = getRadialMask(diffraction_gap.shape,radius)



magnitudeFromDiffraction = cp.sqrt(diffraction_gap)

dStep , realStep , realGuess = singleReconstructionStep(magnitudeFromDiffraction,cp.zeros(diffraction.shape),support)


for i in range(100000):
    phases = cp.angle(dStep)
    estimatedMagitude = cp.abs(dStep)
    dStep , dGuess, realStep , realGuess = singleMaskedReconstructionStep(magnitudeFromDiffraction,estimatedMagitude,phases,support,mask)


plot2SideBySide(np.abs(realStep.get()),np.abs(dStep.get())**2)

This forums is not generic. It is specific to CUDA. I take that back, it is even more specific than that. It is specific to CUDA programming and performance. Your question is specific. It is specific to cuPy. cuPy ≠ CUDA.

I do not know of a generic GPU computing forum, but it is highly likely there is one somewhere. You might want to consult an internet search engine of your choice.

^^ Well, since my issue is one of performance when using and NVIDIA GPU with the help of CUDA (which, I may be completely wrong here, is what cupy does…) especially the last part of General discussion area for algorithms, optimizations, and approaches to GPU Computing with CUDA C, C++, Thrust, Fortran, Python (pyCUDA), etc. apparently mislead me to think this was the right place to come. Apologies.

Let’s use a modified version of your code. I’ve ripped out some whitespace, removed all plotting, and reduce the final loop count from 100000 to 3. To ascertain whether we have undesired H->D or D->H copy traffic going on during the loop iterations, we don’t need to do 100000 loops. 3 will suffice. Here’s the modified code and a profiler run. The profiler output is lengthy (more than 100 lines) so I opted to capture it to a file:

$ cat t85.py
import numpy as np
import cupy as cp
import scipy as sp
import scipy.ndimage

def getDoubleSquare():
    shape = np.zeros((200,200))
    shape[90:100,90:100] = 1
    shape[100:110,100:110] = 1
    shape_smooth = sp.ndimage.gaussian_filter(shape,2)
    return cp.array(shape_smooth)

def zeropadding(image,paddingFactor):

  if paddingFactor > 1:
    originalSizes = np.array(image.shape)
    paddingFactor = paddingFactor
    paddedSizes = np.array((paddingFactor*(originalSizes)),dtype=int)+np.array([1,1])

    offsets = np.array((paddedSizes-originalSizes)/2, dtype=int)
    zeropaddedImage = cp.zeros(paddedSizes)

    zeropaddedImage[offsets[0]:offsets[0]+originalSizes[0],offsets[1]:offsets[1]+originalSizes[1] ] = image
    return zeropaddedImage

  return image

def getRadiiMap(shape, center):
  x , y = cp.indices(shape)
  return cp.sqrt((x-center[-1])**2 + (y-center[1])**2)

def getRadialMask(shape, radius):
  mask           = cp.ones(shape)
  center         = (cp.floor(shape[0]/2),cp.floor(shape[1]/2))
  radiiMap       = getRadiiMap(shape,center)

  mask[radiiMap > radius] = 0

  return cp.array(sp.ndimage.gaussian_filter(mask.get(),2))

def singleReconstructionStep(diffMag, fPhases, support):
    fguess    = diffMag*cp.exp(1j*fPhases)
    realguess =cp.fft.ifftshift(cp.fft.ifft2(cp.fft.fftshift(fguess)))
    realstep  = cp.multiply(realguess,support)
    fstep     = cp.fft.fftshift(cp.fft.fft2(cp.fft.ifftshift(realstep)))
    return fstep , realstep , realguess


def singleMaskedReconstructionStep(diffMag, fMags,fPhases, support, mask ):
    fguess = (cp.multiply(diffMag, mask)+ cp.multiply(fMags, -(mask-1) )) *cp.exp(1j*fPhases)

    realguess =cp.fft.ifftshift(cp.fft.ifft2(cp.fft.fftshift(fguess)))
    realstep = cp.multiply(realguess,support)
    fstep = cp.fft.fftshift(cp.fft.fft2(cp.fft.ifftshift(realstep)))
    return fstep, fguess, realstep , realguess

shape = getDoubleSquare()

diffraction = cp.abs(np.fft.fftshift(cp.fft.fftn(shape)))**2

half_gap_px = 6
diffraction_gap = diffraction.copy()
center = (int(diffraction.shape[0]/2),int(diffraction.shape[1]/2))
diffraction_gap[center[0]-half_gap_px:center[0]+half_gap_px, :] =0.0000000000000001

mask = cp.ones(diffraction_gap.shape)
mask[center[0]-half_gap_px:center[0]+half_gap_px, :] =0

autocorr = cp.fft.fftn(diffraction_gap)

radius = 25
support  = getRadialMask(diffraction_gap.shape,radius)

magnitudeFromDiffraction = cp.sqrt(diffraction_gap)

dStep , realStep , realGuess = singleReconstructionStep(magnitudeFromDiffraction,cp.zeros(diffraction.shape),support)

for i in range(3):
    phases = cp.angle(dStep)
    estimatedMagitude = cp.abs(dStep)
    dStep , dGuess, realStep , realGuess = singleMaskedReconstructionStep(magnitudeFromDiffraction,estimatedMagitude,phases,support,mask)

print(dStep)
$ nvprof --print-gpu-trace --log-file profout.txt python t85.py
[[-5.97315970e-11+1.40631006e-16j -2.59929037e-11-7.76418938e-17j
   8.50966749e-11-2.38055633e-16j ...  2.98014524e-10+3.67665441e-16j
   1.38294391e-10+4.20832158e-16j  1.53517077e-12+3.31693888e-16j]
 [ 9.98116624e-12-6.20064813e-16j  2.23432659e-11-7.99785680e-16j
   9.97066484e-11-9.39857432e-16j ...  3.36414701e-10-4.48025333e-16j
   2.02616782e-10-3.98604894e-16j  7.68111307e-11-4.66628508e-16j]
 [ 1.46782715e-10+8.73292816e-16j  1.23035276e-10+7.84899895e-16j
   1.30885306e-10+6.99079972e-16j ...  3.56507922e-10+6.87495650e-16j
   2.93999326e-10+8.63713457e-16j  2.09922571e-10+9.15482018e-16j]
 ...
 [ 2.53047243e-10+8.82063702e-16j  2.97185532e-10+6.59105247e-16j
   3.28637075e-10+4.03450078e-16j ...  1.23400872e-10+5.98930668e-16j
   1.73388224e-10+8.70692048e-16j  2.11914235e-10+9.70604757e-16j]
 [ 1.04025126e-10-8.77326689e-17j  1.47955714e-10-4.41165466e-16j
   2.26180271e-10-7.85323836e-16j ...  2.20242950e-10+2.19411071e-16j
   1.62199433e-10+2.82019930e-16j  1.12076001e-10+1.75744369e-16j]
 [-2.28254718e-11-3.18512081e-16j  1.95687967e-11-4.93504607e-16j
   1.28431498e-10-6.37567104e-16j ...  2.64247830e-10-2.83154438e-16j
   1.31162617e-10-1.78750865e-16j  1.97898065e-11-1.94397252e-16j]]
$

(I added a print at the end, just for grins). This was run on CUDA 11.5, and on a GeForce GTX 960 (maxwell), so nvprof is the right tool to use.

Here is the “tail end” contents of the file:

810.20ms  6.8160us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [672]
810.29ms  66.529us             (25 1 1)        (8 40 1)        72        0B  12.500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [675]
810.36ms  76.225us            (100 1 1)        (40 2 1)        72        0B  6.2500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [677]
810.43ms  7.2320us                    -               -         -         -         -  312.50KB  41.209GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
810.44ms  7.2640us                    -               -         -         -         -  312.50KB  41.027GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
810.63ms  7.4880us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [688]
810.67ms  6.7520us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [691]
810.73ms  47.297us            (313 1 1)       (128 1 1)        26        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_angle__complex128_float64 [695]
810.78ms  20.064us            (313 1 1)       (128 1 1)        23        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_absolute__complex128_float64 [699]
810.80ms  11.136us            (313 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__float64_float64_float64 [703]
810.86ms  7.6800us            (313 1 1)       (128 1 1)        10        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_subtract__float64_float_float64 [707]
810.89ms  5.4720us            (313 1 1)       (128 1 1)        10        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_negative__float64_float64 [711]
810.92ms  7.6800us            (313 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__float64_float64_float64 [715]
810.96ms  7.6800us            (313 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_add__float64_float64_float64 [719]
811.01ms  9.8240us            (313 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__complex_float64_complex128 [723]
811.04ms  39.489us            (313 1 1)       (128 1 1)        38        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_exp__complex128_complex128 [727]
811.08ms  14.016us            (313 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__float64_complex128_complex128 [731]
811.16ms  8.1920us                    -               -         -         -         -  312.50KB  36.380GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
811.19ms  8.3840us                    -               -         -         -         -  312.50KB  35.547GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
811.28ms  5.8880us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [739]
811.32ms  6.7200us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [742]
811.41ms  66.785us             (25 1 1)        (8 40 1)        72        0B  12.500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [745]
811.48ms  77.377us            (100 1 1)        (40 2 1)        72        0B  6.2500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [747]
811.55ms  28.544us            (313 1 1)       (128 1 1)        23        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_true_divide__complex128_complex_complex128 [751]
811.58ms  6.5280us                    -               -         -         -         -  312.50KB  45.653GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
811.59ms  7.3930us                    -               -         -         -         -  312.50KB  40.312GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
811.68ms  6.1440us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [759]
811.73ms  7.7760us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [762]
811.77ms  15.456us            (313 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__complex128_float64_complex128 [766]
811.85ms  7.9680us                    -               -         -         -         -  312.50KB  37.403GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
811.88ms  8.3520us                    -               -         -         -         -  312.50KB  35.683GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
811.96ms  6.0800us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [774]
812.01ms  7.0080us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [777]
812.10ms  64.064us             (25 1 1)        (8 40 1)        72        0B  12.500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [780]
812.16ms  79.873us            (100 1 1)        (40 2 1)        72        0B  6.2500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [782]
812.24ms  6.8800us                    -               -         -         -         -  312.50KB  43.317GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
812.25ms  7.4560us                    -               -         -         -         -  312.50KB  39.971GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
812.43ms  6.8170us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [793]
812.48ms  6.7840us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [796]
812.53ms  47.296us            (313 1 1)       (128 1 1)        26        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_angle__complex128_float64 [800]
812.57ms  20.288us            (313 1 1)       (128 1 1)        23        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_absolute__complex128_float64 [804]
812.60ms  10.432us            (313 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__float64_float64_float64 [808]
812.65ms  7.4240us            (313 1 1)       (128 1 1)        10        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_subtract__float64_float_float64 [812]
812.69ms  5.3750us            (313 1 1)       (128 1 1)        10        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_negative__float64_float64 [816]
812.73ms  7.1360us            (313 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__float64_float64_float64 [820]
812.76ms  7.3920us            (313 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_add__float64_float64_float64 [824]
812.82ms  10.432us            (313 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__complex_float64_complex128 [828]
812.85ms  39.841us            (313 1 1)       (128 1 1)        38        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_exp__complex128_complex128 [832]
812.89ms  13.664us            (313 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__float64_complex128_complex128 [836]
812.97ms  8.6080us                    -               -         -         -         -  312.50KB  34.622GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
813.00ms  8.5760us                    -               -         -         -         -  312.50KB  34.751GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
813.08ms  6.2080us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [844]
813.13ms  7.9370us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [847]
813.22ms  65.794us             (25 1 1)        (8 40 1)        72        0B  12.500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [850]
813.28ms  76.640us            (100 1 1)        (40 2 1)        72        0B  6.2500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [852]
813.36ms  28.032us            (313 1 1)       (128 1 1)        23        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_true_divide__complex128_complex_complex128 [856]
813.39ms  6.3360us                    -               -         -         -         -  312.50KB  47.036GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
813.40ms  7.6480us                    -               -         -         -         -  312.50KB  38.967GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
813.48ms  6.4320us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [864]
813.53ms  7.1040us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [867]
813.57ms  14.432us            (313 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_multiply__complex128_float64_complex128 [871]
813.65ms  7.9680us                    -               -         -         -         -  312.50KB  37.403GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
813.68ms  8.5760us                    -               -         -         -         -  312.50KB  34.751GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
813.79ms  5.9210us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [879]
813.86ms  7.1040us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [882]
813.94ms  65.537us             (25 1 1)        (8 40 1)        72        0B  12.500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [885]
814.01ms  73.793us            (100 1 1)        (40 2 1)        72        0B  6.2500KB         -           -           -           -  NVIDIA GeForce          1         7  void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [887]
814.09ms  6.8480us                    -               -         -         -         -  312.50KB  43.520GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
814.09ms  7.5520us                    -               -         -         -         -  312.50KB  39.463GB/s      Device      Device  NVIDIA GeForce          1         7  [CUDA memcpy DtoD]
814.16ms  6.6240us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [896]
814.20ms  7.2640us            (157 1 1)       (128 1 1)        17        0B        0B         -           -           -           -  NVIDIA GeForce          1         7  cupy_copy__complex128_complex128 [899]
814.26ms  97.057us                    -               -         -         -         -  625.00KB  6.1412GB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

That final print statement I added is what created this final trace entry:

814.26ms  97.057us                    -               -         -         -         -  625.00KB  6.1412GB/s      Device    Pageable  NVIDIA GeForce          1         7  [CUDA memcpy DtoH]

Other than that, there are no instances of DtoH or HtoD in the above “tail end”. With a bit of effort you can map the profiler activity in the above “tail end” back to your source code (*), and conclude that we are covering at least 1 whole loop iteration. As a result, I conclude that a loop iteration involves no “unexpected” HtoD or DtoH activity.

(*) Knowing that the 3rd loop iteration ends at the end of the profiler output, we can study the sequence of steps leading up to that, then scan backward through the file to find the next example of that repeating sequence. This allows us to scope a loop iteration within the profiler output. Using this methodology, I conclude (mistakes are possible) that the last entry corresponding to the second loop iteration is at timestamp 810.67ms, and the first entry of the third and final loop begins at 810.73ms.

Oh wow, thanks a lot Robert! I will try to reproduce this, nvprof gives me

cat profout.txt 
==15967== NVPROF is profiling process 15967, command: python3 cudaCDI_profExp.py
==15967== Warning: ERR_NVGPUCTRPERM - The user does not have permission to profile on the target device. See the following link for instructions to enable permissions and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM 
==15967== Profiling application: python3 cudaCDI_profExp.py
==15967== Profiling result:
No kernels were profiled.
==15967== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.

Thanks a lot for your help, since apparently the code already does what I wanted it to I hope I can take it for here. But first I need to sort this profiler issue out… (btw. I got a Quadro M2000, so rather old…)
Cheers

This info, already included in the error output, could be your starting point to get more information:

Here is that link directly: NVIDIA Development Tools Solutions - ERR_NVGPUCTRPERM: Permission issue with Performance Counters | NVIDIA Developer

Thanks, I saw that, and now after installing cupy also in in the root environment using sudo works.