Let’s use a modified version of your code. I’ve ripped out some whitespace, removed all plotting, and reduce the final loop count from 100000 to 3. To ascertain whether we have undesired H->D or D->H copy traffic going on during the loop iterations, we don’t need to do 100000 loops. 3 will suffice. Here’s the modified code and a profiler run. The profiler output is lengthy (more than 100 lines) so I opted to capture it to a file:

```
$ cat t85.py
import numpy as np
import cupy as cp
import scipy as sp
import scipy.ndimage
def getDoubleSquare():
shape = np.zeros((200,200))
shape[90:100,90:100] = 1
shape[100:110,100:110] = 1
shape_smooth = sp.ndimage.gaussian_filter(shape,2)
return cp.array(shape_smooth)
def zeropadding(image,paddingFactor):
if paddingFactor > 1:
originalSizes = np.array(image.shape)
paddingFactor = paddingFactor
paddedSizes = np.array((paddingFactor*(originalSizes)),dtype=int)+np.array([1,1])
offsets = np.array((paddedSizes-originalSizes)/2, dtype=int)
zeropaddedImage = cp.zeros(paddedSizes)
zeropaddedImage[offsets[0]:offsets[0]+originalSizes[0],offsets[1]:offsets[1]+originalSizes[1] ] = image
return zeropaddedImage
return image
def getRadiiMap(shape, center):
x , y = cp.indices(shape)
return cp.sqrt((x-center[-1])**2 + (y-center[1])**2)
def getRadialMask(shape, radius):
mask = cp.ones(shape)
center = (cp.floor(shape[0]/2),cp.floor(shape[1]/2))
radiiMap = getRadiiMap(shape,center)
mask[radiiMap > radius] = 0
return cp.array(sp.ndimage.gaussian_filter(mask.get(),2))
def singleReconstructionStep(diffMag, fPhases, support):
fguess = diffMag*cp.exp(1j*fPhases)
realguess =cp.fft.ifftshift(cp.fft.ifft2(cp.fft.fftshift(fguess)))
realstep = cp.multiply(realguess,support)
fstep = cp.fft.fftshift(cp.fft.fft2(cp.fft.ifftshift(realstep)))
return fstep , realstep , realguess
def singleMaskedReconstructionStep(diffMag, fMags,fPhases, support, mask ):
fguess = (cp.multiply(diffMag, mask)+ cp.multiply(fMags, -(mask-1) )) *cp.exp(1j*fPhases)
realguess =cp.fft.ifftshift(cp.fft.ifft2(cp.fft.fftshift(fguess)))
realstep = cp.multiply(realguess,support)
fstep = cp.fft.fftshift(cp.fft.fft2(cp.fft.ifftshift(realstep)))
return fstep, fguess, realstep , realguess
shape = getDoubleSquare()
diffraction = cp.abs(np.fft.fftshift(cp.fft.fftn(shape)))**2
half_gap_px = 6
diffraction_gap = diffraction.copy()
center = (int(diffraction.shape[0]/2),int(diffraction.shape[1]/2))
diffraction_gap[center[0]-half_gap_px:center[0]+half_gap_px, :] =0.0000000000000001
mask = cp.ones(diffraction_gap.shape)
mask[center[0]-half_gap_px:center[0]+half_gap_px, :] =0
autocorr = cp.fft.fftn(diffraction_gap)
radius = 25
support = getRadialMask(diffraction_gap.shape,radius)
magnitudeFromDiffraction = cp.sqrt(diffraction_gap)
dStep , realStep , realGuess = singleReconstructionStep(magnitudeFromDiffraction,cp.zeros(diffraction.shape),support)
for i in range(3):
phases = cp.angle(dStep)
estimatedMagitude = cp.abs(dStep)
dStep , dGuess, realStep , realGuess = singleMaskedReconstructionStep(magnitudeFromDiffraction,estimatedMagitude,phases,support,mask)
print(dStep)
$ nvprof --print-gpu-trace --log-file profout.txt python t85.py
[[-5.97315970e-11+1.40631006e-16j -2.59929037e-11-7.76418938e-17j
8.50966749e-11-2.38055633e-16j ... 2.98014524e-10+3.67665441e-16j
1.38294391e-10+4.20832158e-16j 1.53517077e-12+3.31693888e-16j]
[ 9.98116624e-12-6.20064813e-16j 2.23432659e-11-7.99785680e-16j
9.97066484e-11-9.39857432e-16j ... 3.36414701e-10-4.48025333e-16j
2.02616782e-10-3.98604894e-16j 7.68111307e-11-4.66628508e-16j]
[ 1.46782715e-10+8.73292816e-16j 1.23035276e-10+7.84899895e-16j
1.30885306e-10+6.99079972e-16j ... 3.56507922e-10+6.87495650e-16j
2.93999326e-10+8.63713457e-16j 2.09922571e-10+9.15482018e-16j]
...
[ 2.53047243e-10+8.82063702e-16j 2.97185532e-10+6.59105247e-16j
3.28637075e-10+4.03450078e-16j ... 1.23400872e-10+5.98930668e-16j
1.73388224e-10+8.70692048e-16j 2.11914235e-10+9.70604757e-16j]
[ 1.04025126e-10-8.77326689e-17j 1.47955714e-10-4.41165466e-16j
2.26180271e-10-7.85323836e-16j ... 2.20242950e-10+2.19411071e-16j
1.62199433e-10+2.82019930e-16j 1.12076001e-10+1.75744369e-16j]
[-2.28254718e-11-3.18512081e-16j 1.95687967e-11-4.93504607e-16j
1.28431498e-10-6.37567104e-16j ... 2.64247830e-10-2.83154438e-16j
1.31162617e-10-1.78750865e-16j 1.97898065e-11-1.94397252e-16j]]
$
```

(I added a print at the end, just for grins). This was run on CUDA 11.5, and on a GeForce GTX 960 (maxwell), so nvprof is the right tool to use.

Here is the “tail end” contents of the file:

```
810.20ms 6.8160us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [672]
810.29ms 66.529us (25 1 1) (8 40 1) 72 0B 12.500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [675]
810.36ms 76.225us (100 1 1) (40 2 1) 72 0B 6.2500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [677]
810.43ms 7.2320us - - - - - 312.50KB 41.209GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
810.44ms 7.2640us - - - - - 312.50KB 41.027GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
810.63ms 7.4880us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [688]
810.67ms 6.7520us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [691]
810.73ms 47.297us (313 1 1) (128 1 1) 26 0B 0B - - - - NVIDIA GeForce 1 7 cupy_angle__complex128_float64 [695]
810.78ms 20.064us (313 1 1) (128 1 1) 23 0B 0B - - - - NVIDIA GeForce 1 7 cupy_absolute__complex128_float64 [699]
810.80ms 11.136us (313 1 1) (128 1 1) 13 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__float64_float64_float64 [703]
810.86ms 7.6800us (313 1 1) (128 1 1) 10 0B 0B - - - - NVIDIA GeForce 1 7 cupy_subtract__float64_float_float64 [707]
810.89ms 5.4720us (313 1 1) (128 1 1) 10 0B 0B - - - - NVIDIA GeForce 1 7 cupy_negative__float64_float64 [711]
810.92ms 7.6800us (313 1 1) (128 1 1) 13 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__float64_float64_float64 [715]
810.96ms 7.6800us (313 1 1) (128 1 1) 13 0B 0B - - - - NVIDIA GeForce 1 7 cupy_add__float64_float64_float64 [719]
811.01ms 9.8240us (313 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__complex_float64_complex128 [723]
811.04ms 39.489us (313 1 1) (128 1 1) 38 0B 0B - - - - NVIDIA GeForce 1 7 cupy_exp__complex128_complex128 [727]
811.08ms 14.016us (313 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__float64_complex128_complex128 [731]
811.16ms 8.1920us - - - - - 312.50KB 36.380GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
811.19ms 8.3840us - - - - - 312.50KB 35.547GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
811.28ms 5.8880us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [739]
811.32ms 6.7200us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [742]
811.41ms 66.785us (25 1 1) (8 40 1) 72 0B 12.500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [745]
811.48ms 77.377us (100 1 1) (40 2 1) 72 0B 6.2500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [747]
811.55ms 28.544us (313 1 1) (128 1 1) 23 0B 0B - - - - NVIDIA GeForce 1 7 cupy_true_divide__complex128_complex_complex128 [751]
811.58ms 6.5280us - - - - - 312.50KB 45.653GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
811.59ms 7.3930us - - - - - 312.50KB 40.312GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
811.68ms 6.1440us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [759]
811.73ms 7.7760us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [762]
811.77ms 15.456us (313 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__complex128_float64_complex128 [766]
811.85ms 7.9680us - - - - - 312.50KB 37.403GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
811.88ms 8.3520us - - - - - 312.50KB 35.683GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
811.96ms 6.0800us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [774]
812.01ms 7.0080us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [777]
812.10ms 64.064us (25 1 1) (8 40 1) 72 0B 12.500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [780]
812.16ms 79.873us (100 1 1) (40 2 1) 72 0B 6.2500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [782]
812.24ms 6.8800us - - - - - 312.50KB 43.317GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
812.25ms 7.4560us - - - - - 312.50KB 39.971GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
812.43ms 6.8170us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [793]
812.48ms 6.7840us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [796]
812.53ms 47.296us (313 1 1) (128 1 1) 26 0B 0B - - - - NVIDIA GeForce 1 7 cupy_angle__complex128_float64 [800]
812.57ms 20.288us (313 1 1) (128 1 1) 23 0B 0B - - - - NVIDIA GeForce 1 7 cupy_absolute__complex128_float64 [804]
812.60ms 10.432us (313 1 1) (128 1 1) 13 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__float64_float64_float64 [808]
812.65ms 7.4240us (313 1 1) (128 1 1) 10 0B 0B - - - - NVIDIA GeForce 1 7 cupy_subtract__float64_float_float64 [812]
812.69ms 5.3750us (313 1 1) (128 1 1) 10 0B 0B - - - - NVIDIA GeForce 1 7 cupy_negative__float64_float64 [816]
812.73ms 7.1360us (313 1 1) (128 1 1) 13 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__float64_float64_float64 [820]
812.76ms 7.3920us (313 1 1) (128 1 1) 13 0B 0B - - - - NVIDIA GeForce 1 7 cupy_add__float64_float64_float64 [824]
812.82ms 10.432us (313 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__complex_float64_complex128 [828]
812.85ms 39.841us (313 1 1) (128 1 1) 38 0B 0B - - - - NVIDIA GeForce 1 7 cupy_exp__complex128_complex128 [832]
812.89ms 13.664us (313 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__float64_complex128_complex128 [836]
812.97ms 8.6080us - - - - - 312.50KB 34.622GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
813.00ms 8.5760us - - - - - 312.50KB 34.751GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
813.08ms 6.2080us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [844]
813.13ms 7.9370us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [847]
813.22ms 65.794us (25 1 1) (8 40 1) 72 0B 12.500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [850]
813.28ms 76.640us (100 1 1) (40 2 1) 72 0B 6.2500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [852]
813.36ms 28.032us (313 1 1) (128 1 1) 23 0B 0B - - - - NVIDIA GeForce 1 7 cupy_true_divide__complex128_complex_complex128 [856]
813.39ms 6.3360us - - - - - 312.50KB 47.036GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
813.40ms 7.6480us - - - - - 312.50KB 38.967GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
813.48ms 6.4320us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [864]
813.53ms 7.1040us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [867]
813.57ms 14.432us (313 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_multiply__complex128_float64_complex128 [871]
813.65ms 7.9680us - - - - - 312.50KB 37.403GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
813.68ms 8.5760us - - - - - 312.50KB 34.751GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
813.79ms 5.9210us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [879]
813.86ms 7.1040us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [882]
813.94ms 65.537us (25 1 1) (8 40 1) 72 0B 12.500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=8, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=1, unsigned int, double>(kernel_arguments_t<unsigned int>) [885]
814.01ms 73.793us (100 1 1) (40 2 1) 72 0B 6.2500KB - - - - NVIDIA GeForce 1 7 void composite_2way_fft<unsigned int=200, unsigned int=5, unsigned int=2, padding_t=0, twiddle_t=0, loadstore_modifier_t=2, unsigned int=8, layout_t=0, unsigned int, double>(kernel_arguments_t<unsigned int>) [887]
814.09ms 6.8480us - - - - - 312.50KB 43.520GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
814.09ms 7.5520us - - - - - 312.50KB 39.463GB/s Device Device NVIDIA GeForce 1 7 [CUDA memcpy DtoD]
814.16ms 6.6240us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [896]
814.20ms 7.2640us (157 1 1) (128 1 1) 17 0B 0B - - - - NVIDIA GeForce 1 7 cupy_copy__complex128_complex128 [899]
814.26ms 97.057us - - - - - 625.00KB 6.1412GB/s Device Pageable NVIDIA GeForce 1 7 [CUDA memcpy DtoH]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$
```

That final print statement I added is what created this final trace entry:

```
814.26ms 97.057us - - - - - 625.00KB 6.1412GB/s Device Pageable NVIDIA GeForce 1 7 [CUDA memcpy DtoH]
```

Other than that, there are no instances of `DtoH`

or `HtoD`

in the above “tail end”. With a bit of effort you can map the profiler activity in the above “tail end” back to your source code (`*`

), and conclude that we are covering at least 1 whole loop iteration. As a result, I conclude that a loop iteration involves no “unexpected” HtoD or DtoH activity.

(`*`

) Knowing that the 3rd loop iteration ends at the end of the profiler output, we can study the sequence of steps leading up to that, then scan backward through the file to find the next example of that repeating sequence. This allows us to scope a loop iteration within the profiler output. Using this methodology, I conclude (mistakes are possible) that the last entry corresponding to the second loop iteration is at timestamp 810.67ms, and the first entry of the third and final loop begins at 810.73ms.