Why are CUFFT plans so large, and what can be done about this?

GPU DRAM is precious enough as it is - why are CUFFT plans half a gigabyte large? What can be done to limit this size? To clarify, I have several different batched plans of different length - are there ways to specify the batch length at the time of calling the plan, rather than making individual plans?

After playing around with the worksize estimating functions, it seems that CUFFT is requiring an amount of extra work space equal to the size of the input/output arrays for the transform. Should this really be the case? There is no way no minimize this footprint if I want to execute several identical plans with different batch lengths?

One aspect of CUFFT library overhead is initialization overhead - including memory. You will experience this initialization usually on the first call to a CUFFT planning operation. For memory requirements associated with plans beyond the first call, these would be due to plan-specific memory requirements.

For the plan-specific memory needed, CUFFT gives you the option of managing this yourself. This would probably reduce whatever impact you are seeing. So for example, if you wanted to have 6 plans set up and ready to go, but only needed to be able to actually execute 1 plan at a time, you could create all 6 plans but inform CUFFT that you want to manage the workspace yourself. You would then need to keep track of the largest workspace requirement of all 6 plans, allocate for that, and then pass the workspace allocation to each CUFFT exec call. This should reduce the per-plan-creation impact to a lower level.

Of course if you need to execute two or more plans simultaneously, then you’ll need two or more workspaces. And there are other caveats, but its all spelled out in the CUFFT documentation.

Thanks txbob. I will look into this.