There is no such parameter in fftw or mkl’s fft2dfti wrapper. Why it is needed?

Because fftw and mkl are written for scalar processors and cufft is written for massively parallel gpus.

Thanks! Then it means I have to manually devide the data for parallism?

Or how to select the optimal batch value?

You can certainly do single transforms (batch=1). But you can get better speed on average if you can rephrase your algorithm to perform lots of transforms together.

In this case, the transforms should be the same length?

Does cufft show higer efficiency than cpu version fft program when batch=1?