I’m doing some in-place real-to-complex and complex-to-real transforms. When I copy from a real data array into the array I will be FFT’ing (just using a 1D kernel storing fft_array[i].x = array1[2 * i]), I access every other element of the full array (array1), so the transfers shouldn’t be contiguous, but I still find the kernel calls are fairly fast.

However, the kernel to copy the real parts the FFT array into a different (real) array (call it array2) is about 6-7x slower. This kernel is doing copies of the form array2[8 * i] = fft_array[i].x.

I’m unsure why this second kernel is so much slower, when both transfer the same amount of data and neither is contiguous. Any general advice on optimizing such transfers would be greatly appreciated!