CUFFT library behavior is not completely “uniform” independent of transform size. You can get some idea of this here. Evidently, certain transform sizes cause CUFFT to decompose the problem in a way that uses more memory. The end result is that CUFFT memory usage is not perfectly proportional to transform size.
If you can pad the size up to the next size that fits the definition given for “better path”:
size = 2^a*3^b*5^c*7^d
then you will likely have a better experience. In your case, the factors of your chosen size are:
1, 13, 19, 211, 247, 967, 2743, 4009, 12571, 18373, 52117, 204037, 238849, 2652481, 3876703, 50397139
Since the smallest factor is 13, we can tell it doesn’t fit the best path. This doesn’t really say anything about memory utilization, but I think it is quite likely to be related. You’ve already indicated a possible workaround - find a larger transform size that doesn’t run into the memory issue, and pad it.
As an example, I think if you padded your transforms up to the next increment of 1048576 that fits the “pattern”, you will have a better experience. For example, using the number you indicated, if we choose 51380224, (= 49 * 1048576) then that should work on your 6GB GPU.
And as already indicated, you will probably have better luck with your transform size on a GPU with 8GB or more of memory.
You can also file a bug, although I’m fairly confident the CUFFT library designers are aware of this phenomenon.
It’s OK if you don’t believe me. As far as I am concerned, your test case runs properly on a GPU with more than 6GB of memory, and furthermore we can observe that your test case requires ~6GB of memory on a GPU with 32GB. Those data points are quite convincing for me.