I need to create cuFFT plans dynamically in the main loop of my application, and I noticed that they cause a device synchronization. I suppose this is because of underlying calls to cudaMalloc.
This behaviour is undesirable for me, and since stream ordered memory allocators (cudaMallocAsync / cudaFreeAsync) have been introduced in CUDA, I was wondering if you could provide a streamed cuFFT plan allocator.
You may wish to investigate caller-allocated work areas to see if it can be adapted to your use-case. Depending on how you use it, it may provide some benefit.
Thank you for your answer. As a matter of fact, I already allocate the work area myself using the stream ordered memory allocators. My problem here is with the initial plan creation cufftCreate, before any size is given.
The documentation states that cufftMakePlan1d can be called only once per plan. Because I need various and unpredictable FFT lengths and batch sizes, this forces me to delete and create a new plan at each iteration.
I know that the cufft team is aware of desires to improve create/destroy cycle performance. However you may still wish to file a bug if you have specific requests or a specific example to consider.