`CUFFT_INTERNAL_ERROR` when using `cufftPlan` with 1d or 2d in any size

I was going to use cufft to accelerate the conv2d with the codes below:

cufftResult planResult = cufftPlan2d(&data_plan[idx_n*c + idx_c], Nh, Nw, CUFFT_Z2Z);
	if (planResult != CUFFT_SUCCESS) {
		printf("CUFFT plan creation failed: %d\n", planResult);
		// Handle the error appropriately
cufftSetStream(data_plan[idx_n*c + idx_c], stream_data[idx_n*c + idx_c]);

And I got the CUFFT_INTERNAL_ERROR error. After that, I tried the simpler code for test below:

#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <cufft.h>

#ifdef _CUFFT_H_
static const char *cufftGetErrorString( cufftResult cufft_error_type ) {
	switch( cufft_error_type ) {
			return "CUFFT_SUCCESS: The CUFFT operation was performed";
			return "CUFFT_INVALID_PLAN: The CUFFT plan to execute is invalid";
			return "CUFFT_ALLOC_FAILED: The allocation of data for CUFFT in memory failed";
			return "CUFFT_INVALID_TYPE: The data type used by CUFFT is invalid";
			return "CUFFT_INVALID_VALUE: The data value used by CUFFT is invalid";
			return "CUFFT_INTERNAL_ERROR: An internal error occurred in CUFFT";
			return "CUFFT_EXEC_FAILED: The execution of a plan by CUFFT failed";
			return "CUFFT_SETUP_FAILED: The setup of CUFFT failed";
			return "CUFFT_INVALID_SIZE: The size of the data to be used by CUFFT is invalid";
			return "CUFFT_UNALIGNED_DATA: The data to be used by CUFFT is unaligned in memory";
	return "Unknown CUFFT Error";
#define BATCH 1

int main() {
	// unsigned long int data_block_length = 50397139;
    unsigned long int data_block_length = 1024;
	cufftResult cufft_result;
	cufftHandle plan;
	// cufft_result = cufftPlan1d(&plan, data_block_length, CUFFT_Z2Z, BATCH );
	cufft_result = cufftPlan2d(&plan, data_block_length, data_block_length, CUFFT_Z2Z);
	if( cufft_result != CUFFT_SUCCESS ) {
	   printf( "CUFFT Error (%s)\n", cufftGetErrorString( cufft_result ) );

	return 0;

Whatever the data_block_length is(even 0 or 1), Whatever the function I tried is cufftPlan1d or cufftPlan2d, this code just return the CUFFT_INTERNAL_ERROR.

My compile command:

$ nvcc -o test test.cu -lcufft

And I also tried:

$ ldd ./test

and the result is:

linux-vdso.so.1 (0x00007fff25fdb000)
libcufft.so.10 => /usr/lib/x86_64-linux-gnu/libcufft.so.10 (0x000079e10f200000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x000079e10ee00000)
libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x000079e10ea00000)
/lib64/ld-linux-x86-64.so.2 (0x000079e117e3e000)
libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x000079e117d91000)
libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x000079e117ca8000)
libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x000079e117ca3000)
librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x000079e117c9e000)
libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000079e10f1e0000)

The output of nvidia-smi:

Sat Jun 29 10:29:14 2024       
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4070 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   37C    P0             588W /  80W |      9MiB /  8188MiB |      0%      Default |
|                                         |                      |                  N/A |
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|    0   N/A  N/A      1071      G   /usr/lib/xorg/Xorg                            4MiB |

The output of DeviceQuery:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 4070 Laptop GPU"
  CUDA Driver Version / Runtime Version          12.3 / 11.5
  CUDA Capability Major/Minor version number:    8.9
  Total amount of global memory:                 7940 MBytes (8325824512 bytes)
  (036) Multiprocessors, (128) CUDA Cores/MP:    4608 CUDA Cores
  GPU Max Clock rate:                            2175 MHz (2.17 GHz)
  Memory Clock rate:                             8001 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 33554432 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        102400 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1536
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Managed Memory:                Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 11.5, NumDevs = 1
Result = PASS

I tried to resintall the libcufft10 and libcufft, and reinstall the nvidia-cuda-toolkit, these tricks just do not work.

So, what’s going wrong? How can I fix it?

In addtion, my system is UBUNTU 22.04 LTS, g++ version is 11.4.0, nvcc version is V11.5.119.

The minimum recommended CUDA version for use with Ada GPUs (your RTX4070 is Ada generation) is CUDA 11.8.

I don’t have any trouble compiling and running the code you provided on CUDA 12.2 on a Ada generation GPU (L4) on linux.

My CUDA version is 12.3. So should I update my nvcc to a higher version? How could I go a step futher with this error? Could I get more useful information?

Your CUDA driver version is 12.3. Your CUDA runtime version is 11.5:

The minimum recommended CUDA runtime version for use with Ada GPUs (your RTX4070 is Ada generation) is CUDA 11.8. Likewise, the minimum recommended CUDA driver version for use with Ada GPUs is also 11.8. Your driver version is sufficient. Your runtime version is not.

I don’t know of a way to go a step further or get more useful information. My suggestion is to update CUDA to the necessary level. This is a common requirement for CUDA GPUs. Each compute capability will have a minimum recommended CUDA version for use with it (both runtime and driver).

I won’t be able to offer further advice here. Good luck!

Thank you for your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.