cufftGetSize1d fails with a CUFFT_ALLOC_FAILED error

I have a caching scheme to manage the workspace for FFTs myself because there are a large number of different FFTs being applied and this substantially cuts down on memory usage. As part of this, I use cufftGetSize1d (...) to determine the workspace size. However, if I am low on GPU memory, it returns a CUFFT_ALLOC_FAILED error.

From this Stack Overflow question about the same topic, What is the meaning of CUFFT_ALLOC_FAILED return value when calling cufftGetSize*()?, it was deduced that the error actually means the allocation would fail since cufftGetSize1d doesn’t actually allocate memory.

In the debugger, I can estimate the amount to be about 1 GB based on the workspace sizes of similar FFTs currently created. There is less than 1 GB available, which I suspect is the reason for the error. My problem is, I need to know that number, even if there isn’t memory space available, because my application can free GPU memory to make room, it just needs to know how much to free. And in this specific case, the workspace allocated is actually already big enough, so I don’t even need to allocate more, but the program can’t know that since cufftGetSize1d errors rather than telling me the answer. Is there anyway I can get this workspace size, even when the GPU is low on memory?

Even if the method return 16 PB as an answer (which is absolutely ridiculous), that would be good even though I obviously don’t have that much memory, because at least there is a value to report in the error message stating the user needs a GPU with xxx GB memory in order to process the supplied data/configuration.

can you provide a short, complete example of what fails?

When I think of user-managed workspace allocations, in the 1D case, the api sequence I would expect is:

cufftHandle p;
cufftCreate(&p);
cufftSetAutoAllocation(p, 0);
size_t ws;
unsigned char *wsp;
cufftMakePlan1D(p, ..., &ws);
cudaMalloc(&wsp, ws);

Is that what you are doing?

Correct. For completeness, here is a bulk of the code. cudaCheck and cufftCheck are just macros to throw exceptions when respective success is not returned.

cufftHandle handle;
size_t size = 0;
void* workarea = nullptr;
cufftCheck(cufftCreate(&handle));
cufftCheck(cufftSetAutoAllocation(handle, 0));
cufftCheck(cufftPlan1d(&handle, cols, type, rows));
cufftCheck(cufftGetSize1d(handle, cols, type, rows, &size));
cufftCheck(cufftSetStream(handle, stream));
cudaCheck(cudaMalloc(&workarea, size));
cufftCheck(cufftSetWorkArea(handle, workarea);

The full code is a bit more complicated as it caches the plans, and resets the workarea for all cached plans in the event the workarea must grow for a newly created plan. I think this issue only occurs when nearly all available memory on the GPU is consumed. In my current case, I am using ~11.6 GB of GPU memory when the error occurs, and the CUDA properties report ~31 MB free memory.

Your sequence doesn’t match mine.

cufftCreate initializes a handle.
cufftSetAutoAllocation sets a parameter of that handle
cufftPlan1d initializes a handle.

Do you see the issue?

My sequence:

cufftCreate(&p);    //initializes handle
cufftSetAutoAllocation(p, 0);  //updates existing handle
size_t ws;
unsigned char *wsp;
cufftMakePlan1D(p, ..., &ws); //updates existing handle

Here’s an example demonstrating the difference:

$ cat t2248.cu
#include <iostream>
#include <cufft.h>
#include <unistd.h>
#include <cassert>

int main(){
  const int nx = 1048576*32+1;
  const int ny = 16;
  size_t ws = 0;
  cufftHandle p;
  cufftResult r;
  r = cufftCreate(&p);
  assert(r == CUFFT_SUCCESS);
  r = cufftSetAutoAllocation(p, 0);
  assert(r == CUFFT_SUCCESS);
#ifdef USE_MY_METHOD
  r = cufftMakePlan1d(p, nx, CUFFT_C2C, ny, &ws);
#else
  r = cufftPlan1d(&p, nx, CUFFT_C2C, ny);
#endif
  assert(r == CUFFT_SUCCESS);
  std::cout << "ws = " << ws << std::endl;
  size_t mfree, mtot;
  cudaMemGetInfo(&mfree, &mtot);
  std::cout << "free memory: " << mfree << std::endl;
  //sleep(32);
}
$ nvcc -o t2248 t2248.cu -lcufft
$ ./t2248
ws = 0
free memory: 15478554624
$ nvcc -o t2248 t2248.cu -lcufft -DUSE_MY_METHOD
$ ./t2248
ws = 17199267840
free memory: 32679395328
$

When we use your sequence, the call to cufftSetAutoAllocation(..., 0); doesn’t have the desired effect: the plan creation still allocates ~17GB for this particular transform (the GPU above is a V100 32 GB). When we use my sequence, the call to cufftSetAutoAllocation(..., 0); has the desired effect: the call to the plan creation doesn’t allocate space for the work area.

It is true that cufftSetWorkArea() should override the initial allocation, but in a memory-constrained setting you may still be making your life difficult, because that overriding in your example does not happen until you have already allocated more space:

cudaCheck(cudaMalloc(&workarea, size));
cufftCheck(cufftSetWorkArea(handle, workarea);

FWIW, I didn’t have any luck creating “cufftGetSize1d fails with a CUFFT_ALLOC_FAILED error” using my method, and cutting the free memory down to 128MB:

$ cat t2248.cu
#include <iostream>
#include <cufft.h>
#include <unistd.h>
#include <cassert>

int main(){
  const int nx = 1048576*32+1;
  const int ny = 32;
  size_t ws = 0;
  size_t *wsp;
  cufftHandle p;
  cufftResult r;
  r = cufftCreate(&p);
  assert(r == CUFFT_SUCCESS);
  r = cufftSetAutoAllocation(p, 0);
  assert(r == CUFFT_SUCCESS);
#ifdef USE_MY_METHOD
  r = cufftMakePlan1d(p, nx, CUFFT_C2C, ny, &ws);
#else
  r = cufftPlan1d(&p, nx, CUFFT_C2C, ny);
#endif
  assert(r == CUFFT_SUCCESS);
  std::cout << "ws = " << ws << std::endl;
  size_t mfree, mtot;
  cudaMemGetInfo(&mfree, &mtot);
  std::cout << "free memory: " << mfree << std::endl;
  cudaError_t cr = cudaMalloc(&wsp, mfree - 1048576*128);
  assert(cr == cudaSuccess);
  cudaMemGetInfo(&mfree, &mtot);
  std::cout << "free memory: " << mfree << std::endl;
  r = cufftGetSize1d(p, nx, CUFFT_C2C, ny, &ws);
  std::cout << " r = " << (int)r << std::endl;
  std::cout << "ws = " << ws << std::endl;
  //sleep(32);
}
$ nvcc -o t2248 t2248.cu -lcufft -DUSE_MY_METHOD
$ ./t2248
ws = 34398535680
free memory: 32679395328
free memory: 133693440
 r = 0
ws = 256
$

I probably wouldn’t be able to comment further without a complete test case. That test case cannot be your whole code. It needs to be crafted as a directed test, like my example above, that demonstrates the issue.

You are correct. I did not notice that subtle difference, nor did I know about the difference between cufftPlan1d and cufftMakePlan1d. This improved the design of my FFT wrapper, and there is no need to call cufftGetSize1d now. I am guessing this will have a speedup as well since those extra allocations will no longer be happening in the plan generation. Thanks for the assistance!