The cufft library

I’m doing the positive transformation of cufft2D with 4096*4096 grid points, It costs me 0.016s executing the cufftXtExecDescriptorZ2Z function with 4 GPUs, while 0.012s with 2 GPUs and 0.008s with 1 GPU, * What are the reasons for this phenomenon that multiple Gpus take longer times

It is simply because 4096x4096 grid is too small. The communication among GPUs takes more time that the actual computation. Try it out with a larger grid for example 16384x16384 and you will see much better results.