 Why does FFT scaling in frequency domain throw off scale of final spatial domain when X dimension != Y dimension?

I’m using cufft to perform C2C transformation of purely real numbers into the frequency domain and then back out to the spatial domain. I’m using cufftMakePlan2d which has arguments for nx and ny which I am selecting to be powers of two (i.e. 2^n) and then cufftExecC2C() to perform the transformation. If I scale the result of the forward transformation into the frequency domain by sqrt( nx * ny ) and then use cufftExecC2C() to bring it back out (inverse) into the spatial domain again and then scale a second time by sqrt( nx * ny ) the code works fine as long as nx == ny, however, when they are not equal, the final answer I get, depending upon their ratio, is sometimes mis-scaled. For example, if nx = 4 and ny = 2 then my spatial answer at the end is too large by a factor of 2. Strangely, if nx = 8 and ny = 2 then my spatial answer is scaled correctly, but then if nx = 16 and ny = 2 then my spatial answer is scaled up by approximately 1.28.

All of this strangeness goes away if instead of scaling by sqrt( nx * ny ) on the frequency result and then again on the final spatial result, I defer scaling until the end when I come back out into the spatial result and instead scale by nx * ny. So for now I appear to have a workaround, but I’d like to understand what is going on here.

Can anybody help explain this behavior in Cuda v9 compiled code on V100 GPU on Linux x64?