CUFFT_INTERNAL_ERROR during creation of a 1D Plan in CUFFT

Hi everyone!

I’m trying to develop a parallel version of Toeplitz Hashing using FFT on GPU, in CUFFT/CUDA.

And when I try to create a CUFFT 1D Plan, I get an error, which is not much explicit (CUFFT_INTERNAL_ERROR)…

The way I create the cufftPlan1d is the following:

cufftResult cufft_result;

cufftHandle plan;

cufft_result = cufftPlan1d(&plan, data_block_length, CUFFT_Z2Z, 1 );


if( cufft_result != CUFFT_SUCCESS ) {

   printf( "CUFFT Error (%s)\n", cufftGetErrorString( cufft_result ) );

   exit(-1);

}

The data_block_length (an unsigned long int) is equal to 50397139 and I’m using a BATCH of 1 in CUFFT. I verified that for some greater sizes that I tried, I do not get any errors. So, I’m almost sure that is not any memory problem. In particular, for an array of complex numbers with double precision with this size, I should need 16 (2 x 8) x 50397139 = 806354224 bytes = 0.75 gigabytes (approximately) and I have a GPU with 6 gigabytes. I also made all the steps required a priori: cudaMalloc, cudaMemcpy, etc.

However, I read some forums and documentation, I have some doubts regarding some aspects:

  1. It is very recommended to use Batches greater than 1 in CUFFT. In that case I need to use a transform size equal to (data_block_length / BATCH) when I call the cufftPlan1D. It is always worthy to use batches for the CUFFT plans?
  2. In some documentation and information I read, I noticed a lot of people mention the use of power of 2 sizes. Is this a requirement? I can never use a CUFFT plans with sizes different than a power of 2?

I also tried these two options by using zero padding in the remaining positions of the array to get have a size which is a power of 2. And I also tried to use batches with the original size and with power of 2 sizes with zero-padding.

In these trials, I do not get any error, but I get the wrong results. I also developed other versions of my program (in Python using SciPy functions to compute the Toeplitz matrices and fft, serial in C++ using FFTW, and in parallel C++ using FFTW with OpenMP), and I got different results from those versions (with an error around 50%, which suggests that the computation and results are maybe random)…

Can someone help me? I’m feeling a little lost, and the documentation does not help too much sometimes… :(

Thank you in advance!

I don’t get any error when doing cufftPlan1d(&plan, 50397139, CUFFT_Z2Z, 1 ); on CUDA 11.4.

Speaking for myself, if I had a FFT of length n that I needed to do, I would never seek to try to break it up into smaller length FFTs just so I could increase the batch parameter. The main objective with CUFFT should be to launch as much work as possible with each CUFFT exec call. If you have multiple FFTs to do, it is better to batch them up if possible, than to do multiple independent CUFFT exec calls, once for each FFT.

Again, speaking for myself, I don’t think I would attempt to somehow subdivide an FFT. If you have an FFT of length that is not a power of 2, you should probably just do it. If a small amount of padding will bring it up to a power of 2, that might be OK.

In other respects, its usually enough to be aware that often CUFFT will give the best performance when the FFT size is a power of 2. But for other combinations of low-integer factors of the size, it probably is not going to make enough difference to consider how to refactor it. If you have an FFT whose length is for example a large prime number or a product of two large prime numbers, that might be the worst case, performance-wise. But it will still work. And without considering what is involved in refactoring to use another approach, it’s not possible to say that refactoring it (somehow) would be any better.

It’s not possible say what the problem may be, based on what you have provided/shown.

I tried to install the CUDA 11.8 and NVIDIA driver 520.61.01 (the latest versions, basically), and I got the same error.

image

I cannot share too much of the original code because is for a prototype, but I developed a much simpler one where I just test the creation of the plan with the same parameters, and still I got the same error.

The code is the following:

#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include<cuda_device_runtime_api.h>
#include <cufft.h>

#ifdef _CUFFT_H_

	static const char *cufftGetErrorString( cufftResult cufft_error_type ) {

		switch( cufft_error_type ) {

			case CUFFT_SUCCESS:
				return "CUFFT_SUCCESS: The CUFFT operation was performed";

			case CUFFT_INVALID_PLAN:
				return "CUFFT_INVALID_PLAN: The CUFFT plan to execute is invalid";

			case CUFFT_ALLOC_FAILED:
				return "CUFFT_ALLOC_FAILED: The allocation of data for CUFFT in memory failed";

			case CUFFT_INVALID_TYPE:
				return "CUFFT_INVALID_TYPE: The data type used by CUFFT is invalid";

			case CUFFT_INVALID_VALUE:
				return "CUFFT_INVALID_VALUE: The data value used by CUFFT is invalid";

			case CUFFT_INTERNAL_ERROR:
				return "CUFFT_INTERNAL_ERROR: An internal error occurred in CUFFT";

			case CUFFT_EXEC_FAILED:
				return "CUFFT_EXEC_FAILED: The execution of a plan by CUFFT failed";

			case CUFFT_SETUP_FAILED:
				return "CUFFT_SETUP_FAILED: The setup of CUFFT failed";

			case CUFFT_INVALID_SIZE:
				return "CUFFT_INVALID_SIZE: The size of the data to be used by CUFFT is invalid";

			case CUFFT_UNALIGNED_DATA:
				return "CUFFT_UNALIGNED_DATA: The data to be used by CUFFT is unaligned in memory";

		}

		return "Unknown CUFFT Error";

	}

#endif

#define BATCH 1


int main() {

	unsigned long int data_block_length = 50397139;

	cufftResult cufft_result;

	cufftHandle plan;

	cufft_result = cufftPlan1d(&plan, data_block_length, CUFFT_Z2Z, BATCH );


	if( cufft_result != CUFFT_SUCCESS ) {

	   printf( "CUFFT Error (%s)\n", cufftGetErrorString( cufft_result ) );

	   exit(-1);

	}

	return 0;

}

The error I got is the following one:

CUFFT Error (CUFFT_INTERNAL_ERROR: An internal error occurred in CUFFT)

Thanks and sorry for bother you!

I have no issue with 11.8 & 520.61.05 on Kubuntu 22.04.

Maybe you’re linking the wrong libs???

(base) belt@orion:~$ ldd cufft_exe 
        linux-vdso.so.1 (0x00007ffda85df000)
        libcufft.so.10 => /usr/local/cuda/lib64/libcufft.so.10 (0x00007f400fe00000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f400fbd6000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f400f9ae000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f4020dc0000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4020ce4000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f400f8c7000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4020cdd000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f400f8c2000)

I compiled the above example in Ubuntu 20.04 with the following command:

nvcc test.cu -o test -lcufft

I also ran the command:

ldd test

And I got the following output:

linux-vdso.so.1 (0x00007ffc3b0fe000)
libcufft.so.10 => /usr/local/cuda-11.8/lib64/libcufft.so.10 (0x00007fe81565c000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fe815621000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fe8155fe000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fe8155f8000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fe815416000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe815224000)
/lib64/ld-linux-x86-64.so.2 (0x00007fe8265e5000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe8150d3000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fe8150b8000)

Am I doing anything wrong? :(

I also saw some people with the same problem, saying that CUFFT only runs well with nx up to 1024. Is this true?

No its not ture. That thread is 14 years old. What GPU are you running this on? Do any other CUDA codes run normally in that setup?

Yes, I already ran a lot of CUDA code, and always ran well.

For example, I have this simple code:

#include <iostream>
#include <unistd.h>
#include <cuda.h>
using namespace std;
#define GPU_DEVICE_ID 0
#define GRID_SIZE 2
#define BLOCK_SIZE 16


__global__ void hello_world() {

	int cuda_thread_id = (blockIdx.x * blockDim.x) + threadIdx.x;

	# if (__CUDA_ARCH__ >= 200)
    	printf("Hello World in thread no. %d!\n", cuda_thread_id);
	# endif

}


int main() {

	int num_gpu_devices_available = 0;

	cudaGetDeviceCount(&num_gpu_devices_available);

	cout << "\n*** There are " << num_gpu_devices_available <<
			" GPU Device(s) available in this local machine! ***\n" << endl;

	if (num_gpu_devices_available > 0) {

		cudaSetDevice(GPU_DEVICE_ID);

	}
	else {

		cout << "\n*** No GPU Device(s) available in this local machine! ***\n"
			 << endl;

		return -1;

	}

	cout << "\n*** It will be launched, in parallel, a GPU Grid with " <<
			GRID_SIZE << " Block(s),\n    where each one contains " << BLOCK_SIZE <<
			" local CUDA Thread(s), what will result in a total of " <<
			(GRID_SIZE * BLOCK_SIZE) << " CUDA Thread(s)! ***\n" << endl;

	hello_world<<<GRID_SIZE, BLOCK_SIZE>>>();

	cudaDeviceReset();

	cout << "\n" << endl;

	return 0;

}

And I got the following output which shows the code runs as expected:

Two CUDA grids with 16 CUDA threads each, making a total of 32 CUDA threads, each one printing a message, regarding the threadIdx.

Regarding the GPU, it is a GeForce RTX 3060 (NVIDIA Corporation Device 2560) for Laptop (Lenovo Legion 5 15ACH6H):

This particular plan configuration seems to require more than the 6GB of memory that is available on your GPU.

It will work if you run it on a GPU with for example 32GB of memory.

On a V100 GPU with 32GB, and CUDA 11.4, the plan creation here results in a used memory of 6497MiB as reported by nvidia-smi. So it may work in an 8GB GPU and should work in a 16GB GPU.

And what is the justification for that?

For example, why I got no error when I try the same simple code where I just create the plan, changing the value of data_block_length to 67108864? 67108864 is greater than the initial 50397139, right? I think I should need more GPU resources in this case. And yet, it runs without error. I tried other values like 48828125, and it also ran well. I think that there is a more mathematical justification than a resource justification as you pointed, to be honest…

And more than that, I’m not even specifying which type of data I’m using at this point of the code. So, I do not agree with your justification…

And even supposing the worst case, with cufftDoubleComplex which are basically 16 bytes (8 bytes for each double precision component, i.e., real and imaginary parts), 16 x 50397139 = 806354224 bytes (which gives around 0.75 GB), much lower than the 32GB of memory you mentioned…

CUFFT library behavior is not completely “uniform” independent of transform size. You can get some idea of this here. Evidently, certain transform sizes cause CUFFT to decompose the problem in a way that uses more memory. The end result is that CUFFT memory usage is not perfectly proportional to transform size.

If you can pad the size up to the next size that fits the definition given for “better path”:

size = 2^a*3^b*5^c*7^d

then you will likely have a better experience. In your case, the factors of your chosen size are:

1, 13, 19, 211, 247, 967, 2743, 4009, 12571, 18373, 52117, 204037, 238849, 2652481, 3876703, 50397139

Since the smallest factor is 13, we can tell it doesn’t fit the best path. This doesn’t really say anything about memory utilization, but I think it is quite likely to be related. You’ve already indicated a possible workaround - find a larger transform size that doesn’t run into the memory issue, and pad it.

As an example, I think if you padded your transforms up to the next increment of 1048576 that fits the “pattern”, you will have a better experience. For example, using the number you indicated, if we choose 51380224, (= 49 * 1048576) then that should work on your 6GB GPU.

And as already indicated, you will probably have better luck with your transform size on a GPU with 8GB or more of memory.

You can also file a bug, although I’m fairly confident the CUFFT library designers are aware of this phenomenon.

It’s OK if you don’t believe me. As far as I am concerned, your test case runs properly on a GPU with more than 6GB of memory, and furthermore we can observe that your test case requires ~6GB of memory on a GPU with 32GB. Those data points are quite convincing for me.