I was going to use cufft to accelerate the conv2d with the codes below:
cufftResult planResult = cufftPlan2d(&data_plan[idx_n*c + idx_c], Nh, Nw, CUFFT_Z2Z);
if (planResult != CUFFT_SUCCESS) {
printf("CUFFT plan creation failed: %d\n", planResult);
// Handle the error appropriately
}
cufftSetStream(data_plan[idx_n*c + idx_c], stream_data[idx_n*c + idx_c]);
And I got the CUFFT_INTERNAL_ERROR
error. After that, I tried the simpler code for test below:
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include<cuda_device_runtime_api.h>
#include <cufft.h>
#ifdef _CUFFT_H_
static const char *cufftGetErrorString( cufftResult cufft_error_type ) {
switch( cufft_error_type ) {
case CUFFT_SUCCESS:
return "CUFFT_SUCCESS: The CUFFT operation was performed";
case CUFFT_INVALID_PLAN:
return "CUFFT_INVALID_PLAN: The CUFFT plan to execute is invalid";
case CUFFT_ALLOC_FAILED:
return "CUFFT_ALLOC_FAILED: The allocation of data for CUFFT in memory failed";
case CUFFT_INVALID_TYPE:
return "CUFFT_INVALID_TYPE: The data type used by CUFFT is invalid";
case CUFFT_INVALID_VALUE:
return "CUFFT_INVALID_VALUE: The data value used by CUFFT is invalid";
case CUFFT_INTERNAL_ERROR:
return "CUFFT_INTERNAL_ERROR: An internal error occurred in CUFFT";
case CUFFT_EXEC_FAILED:
return "CUFFT_EXEC_FAILED: The execution of a plan by CUFFT failed";
case CUFFT_SETUP_FAILED:
return "CUFFT_SETUP_FAILED: The setup of CUFFT failed";
case CUFFT_INVALID_SIZE:
return "CUFFT_INVALID_SIZE: The size of the data to be used by CUFFT is invalid";
case CUFFT_UNALIGNED_DATA:
return "CUFFT_UNALIGNED_DATA: The data to be used by CUFFT is unaligned in memory";
}
return "Unknown CUFFT Error";
}
#endif
#define BATCH 1
int main() {
// unsigned long int data_block_length = 50397139;
unsigned long int data_block_length = 1024;
cufftResult cufft_result;
cufftHandle plan;
// cufft_result = cufftPlan1d(&plan, data_block_length, CUFFT_Z2Z, BATCH );
cufft_result = cufftPlan2d(&plan, data_block_length, data_block_length, CUFFT_Z2Z);
if( cufft_result != CUFFT_SUCCESS ) {
printf( "CUFFT Error (%s)\n", cufftGetErrorString( cufft_result ) );
exit(-1);
}
return 0;
}
Whatever the data_block_length
is(even 0 or 1), Whatever the function I tried is cufftPlan1d
or cufftPlan2d
, this code just return the CUFFT_INTERNAL_ERROR
.
My compile command:
$ nvcc -o test test.cu -lcufft
And I also tried:
$ ldd ./test
and the result is:
linux-vdso.so.1 (0x00007fff25fdb000)
libcufft.so.10 => /usr/lib/x86_64-linux-gnu/libcufft.so.10 (0x000079e10f200000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x000079e10ee00000)
libc.so.6 => /usr/lib/x86_64-linux-gnu/libc.so.6 (0x000079e10ea00000)
/lib64/ld-linux-x86-64.so.2 (0x000079e117e3e000)
libdl.so.2 => /usr/lib/x86_64-linux-gnu/libdl.so.2 (0x000079e117d91000)
libm.so.6 => /usr/lib/x86_64-linux-gnu/libm.so.6 (0x000079e117ca8000)
libpthread.so.0 => /usr/lib/x86_64-linux-gnu/libpthread.so.0 (0x000079e117ca3000)
librt.so.1 => /usr/lib/x86_64-linux-gnu/librt.so.1 (0x000079e117c9e000)
libgcc_s.so.1 => /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 (0x000079e10f1e0000)
The output of nvidia-smi
:
Sat Jun 29 10:29:14 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4070 ... Off | 00000000:01:00.0 Off | N/A |
| N/A 37C P0 588W / 80W | 9MiB / 8188MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1071 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
The output of DeviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 4070 Laptop GPU"
CUDA Driver Version / Runtime Version 12.3 / 11.5
CUDA Capability Major/Minor version number: 8.9
Total amount of global memory: 7940 MBytes (8325824512 bytes)
(036) Multiprocessors, (128) CUDA Cores/MP: 4608 CUDA Cores
GPU Max Clock rate: 2175 MHz (2.17 GHz)
Memory Clock rate: 8001 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 33554432 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.3, CUDA Runtime Version = 11.5, NumDevs = 1
Result = PASS
I tried to resintall the libcufft10
and libcufft
, and reinstall the nvidia-cuda-toolkit
, these tricks just do not work.
So, what’s going wrong? How can I fix it?