[SOLVED] cuFFT not liking a given length (error 2), but will accept larger work

Good morning, all.

I am working on a dataset that needs FT and its dimensions are:

  • Transform length: 1252
  • Batch: 210945

If we try these numbers in this simple program, we get a cuFFT error 2 (bad allocation), even though cudaMalloc returns 0:

#include <cufft.h>
#include <iostream>

using namespace std;

int main(void)
    {
    int NX = 1252,
        BATCH = 210945;
    cufftHandle c2c_handle;
    cufftComplex *dev_complex;

    cout << "cuFFT return value: " << cufftPlan1d(&c2c_handle, NX, CUFFT_C2C, BATCH) << endl;

    cout << "CUDA return value: " << cudaMalloc((void **) &dev_complex, NX * BATCH * sizeof(cufftComplex)) << endl;

    cudaFree(dev_complex);
	
    return 0;
    }

If we try 1252 and 160945, that is, 50000 less in batch, it works.
4096 samples and 65536 batch works (is larger than 1252 by 210945).
4096 samples and 98304 batch also works (also larger).

Because it works with larger sizes, my guess is that cuFFT is not liking 1252 x 210945. Reading this post:
https://devtalk.nvidia.com/default/topic/1026698/gpu-accelerated-libraries/large-data-size-for-cufft/

It is suggested that the dimensions should be a power of 2, 3, 5 or 7… for optimal performance, not as a requirement. I also couldn’t find any explicit limitation mentioned in the documentation, and the 64M/128M conclusion of the OP probably comes from:
https://devtalk.nvidia.com/default/topic/520201/cufft-size/
https://stackoverflow.com/questions/13187443/nvidia-cufft-limit-on-sizes-and-batches-for-fft-with-scikits-cuda

Which doesn’t seem to be the case anymore due to the larger allocations passing fine.
Do you guys know what is happening?

System is:
CUDA 9.1, Ubuntu 16.04, GTX 1080Ti, driver 390.81
CUDA 9.1, RHEL 6.10, GRID P40-8Q, driver 390.75

Don’t know what you mean by “the larger allocations passing fine”

Are you assuming you can infer what the necessary temporary allocation size would be based on the transform size and batch size?

I would suggest:

  1. You can’t, most likely.
  2. It should be measured (it’s possible to get an idea of what is going on - read the CUFFT docs completely especially the sections about work size and the API variants that allow you to take control of the work size) for any credibility/certainty
  3. It is almost certainly not directly correlated to something as simple as transform size.

Hint: one of the prime factors of 1252 is 313.

The CUFFT API may need very much more temporary space, depending on the specific factorization pattern of the transform size. A larger transform size does not necessarily mean more temp space will be needed.

On my Tesla V100 32GB GPU, your code seems to run correctly, and watching nvidia-smi, seems to need about 15GB of space on the GPU. About 2GB of that is the cudaMalloc allocation. Most of the remainder is attributable to cufft plan needs. The error reported by CUFFT is precisely what is happening.

Here is a recent thread demonstrating the use of cufftEstimate1D:

https://devtalk.nvidia.com/default/topic/1056420/gpu-accelerated-libraries/cufftestimate-memory-consumption/

Try adding this to the beginning of your program:

size_t s = 0;
cufftEstimate1d(1252, CUFFT_C2C, 210945, &s);
cout << "cuFFT estimated size: " << s << endl;

I was indeed assuming that the transform allocation size was directly defined by NX * BATCH, so in my head, if 4096 * 98304 was going without error, why 1252 * 210945 wouldn’t…
I definitely see the impact of the factorization mentioned on your first reply.
Changing to 2048 will require an estimated 3GB (against the 12GB if it was 1252).

Thanks a lot for your help, sir. I’m back on track now.

NX = 1254 also results in a relatively low allocation size (~2GB)

prime factors of 1254: 2,3,11,19

If you’re OK with 2048 it may still run faster than NX=1254

That’s what I am probably going to do, yes. If the number of samples is more than 1024 but less than 2048, pad to 2048.
This information was critical because it also means that the way I predict the work size (input, output, intermediate data…) is wrong. Better late than later…