I am using FFT with a store callback operation in batch mode. My requirement is for the store callback to crop the result and save only the cropped data, making the output buffer 1/8th the size of the FFT plan.
However, when I run the FFT with my callback, I encounter error number 6, despite not accessing the output buffer. I suspect the output buffer might be used as a work buffer and not only as an output buffer, requiring it to be the size of the FFT plan multiplied by the batch size.
Could you please confirm if this is correct and advise on how I can resolve this issue?
Another try was to use cufftXtMakePlanMany (cufftHandle plan, int rank, long long int *n, long long int *inembed, long long int istride, long long int idist, cudaDataType inputtype, long long int *onembed, long long int ostride, long long int odist, cudaDataType outputtype, long long int batch, size_t *workSize, cudaDataType executiontype);
while onembed[0:2] dimensions aresmaller than n dimensions but it still did not work.
In addition I tried to configure another plan with inembed[0:2] while the dimensions are smaller than n dimensions There was call to the full fft plan and not only to the sub planβ¦
I must to say that there was not any cuda error when I use cufftXtMakePlanMany with those attributes
Hi, I think you are right that the output array is used for temporary results. You cannot make it smaller than n[0] * n[1] * n[2] * batch elements, even the documentation says (link: cuFFT):
Note that the size of each dimension of the transform should be less than or equal to the inembed and onembed values for the corresponding dimension, that is n[i] β€ inembed[i] , n[i] β€ onembed[i] , where πβ{0,β¦,ππππβ1}.
I tried to do something similar myself without success.
As a general hint: If you want to crop the FFT output by a lot, also consider a cropped DFT solution (matrix multiplication) as an alternative. It better lends to cropping, but has higher computational complexity (with simpler math).
Thanks for your response about the work buffer. Unfortunately, matrix multiplication is too costly a solution.
Are there any restrictions on using my own callback to read input data and perform zero padding before the FFT transformation?
This would eliminate the need for zero padding in the global memory.
Additionally, I would like to confirm that the input data for the FFT transformation is read-only (when FFT is not in-place) and will not be affected by the FFT process.
Hi, the documentation says that all out-of-place transformations except for C2R preserve the input data. So, I think there should be no restrictions on the load callbacks. However I am pretty sure that the output buffer must be at least the size of the full fft.