cufftEstimate*() memory consumption

Hello everyone
I’m new to cuFFT library. I’m doing a project in which I need to estimate the size of work area needed before computing FFT of an array. The documentation says:

“During plan execution, cuFFT requires a work area for temporary storage of intermediate results. The cufftEstimate*() calls return an estimate for the size of the work area required, given the specified parameters, and assuming default plan settings.”

But doesn’t mention anything about its memory consumption.I wrote this simple piece of code:

#include<cuda_runtime.h>
#include<cufft.h>
#include<cufftXt.h>
#include<stdio.h>
 
int main(){
    size_t free_mem;
    size_t work_area;
    cudaMemGetInfo(&free_mem, NULL);
    printf("free_mem: %zu\n", free_mem);
    cufftEstimate1d(98000000, CUFFT_R2C, 1, &work_area);
    printf("work_area: %zu\n", work_area);
    cudaMemGetInfo(&free_mem, NULL);
    printf("free_mem: %zu\n", free_mem);
    return 0;
}

The output I get is:

free_mem: 1780219904
work_area: 784000512
free_mem: 1751908352

My question is why free memory is reduced about 30MB after calling cufftEstimate1d()? I have another program, putting this piece of code in it and running it causes a 1GB reduction in free memory! So am I missing something? Does this really have something to do with cufftEstimate1d()? for example when I change nx from 98000000 to 98000001 the result I get is:

free_mem: 1762394112
work_area: 8
free_mem: 167247872

which is a very strange result.

The 30MB reduction is probably due to CUFFT library initialization.

The difference in work area sizes for the two cases may be due to the fact that CUFFT uses different algorithms depending on the size of the transform, in particular the prime factorization of the size. If the largest prime factor of the size is relatively small (say, 7 or less), then CUFFT has fast algorithms to handle those kind of transforms, and those algorithms may require substantial extra memory to run fast.

If the largest prime factor is large (which might be the case with 98000001), then CUFFT must use a slower algorithm, and its possible this algorithm requires little if any extra memory.

This is just a guess however. Anyway, these don’t look like problems to me.

I’m not trying to communicate any heuristic for memory required. There is no point trying to infer it. Use the estimation method provided by the API. Assumptions that because there is some pattern in a particular input parameter, I should have an expectation about the memory required, are not documented anywhere, and may be incorrect.

Regarding the 1GB statement, I wouldn’t try to explain the behavior of code you haven’t shown. It may still be some aspect of initialization. The first usage of CUDA or library functions triggers various kinds of initialization, which carries various kinds of overhead with it.

Thanks for the answer.
About the 1GB statement, the second result, which shows a reduction of 1GB in free memory is for the same code I have provided. The only difference, as mentioned, is that I incremented input 98000000 by one and ran the code. Now having your answer in mind, it is obvious that the largest prime factor of 98000001 is greater than 7. So it makes sense that the work area size be much larger than the first case. Two problems though: first, the result returned in work_area is obviously incorrect. second, why is that much memory consumed just by calling cufftEstimate1d()? 30MB for library initialization makes sense, but 1GB?

EDIT: Having read my post again, I understand why one wouldn’t get that the second result is for the same code. I apologize for the ambiguity.

That’s not a 1GB reduction in free memory. That’s 100MB.

And when I run your program on CUDA 10.0, I see a 30MB reduction in free memory whether I use 98000000 or 98000001.

$ cat t2.cu
#include<cuda_runtime.h>
#include<cufft.h>
#include<cufftXt.h>
#include<stdio.h>

int main(){
    size_t free_mem;
    size_t work_area;
    cudaMemGetInfo(&free_mem, NULL);
    printf("free_mem: %zu\n", free_mem);
    cufftResult r = cufftEstimate1d(980000001, CUFFT_R2C, 1, &work_area);
    printf("r = %d\n", (int)r);
    printf("work_area: %zu\n", work_area);
    cudaMemGetInfo(&free_mem, NULL);
    printf("free_mem: %zu\n", free_mem);
    return 0;
}
$ nvcc -o t2 t2.cu -lcufft
$ ./t2
free_mem: 31569018880
r = 0
work_area: 0
free_mem: 31535464448
$

And regarding work area size, I didn’t say the work area size would be larger for the 98000001 case. Re-read my answer.

My suggestion would be that you not worry too much about trying to understand the logic of the memory allocations, since the details are entirely unpublished, and instead just do the work you’d like to do.