CUFFT_INTERNAL_ERROR in cufftPlan1d in MPI code causes unkillable MPI Processes

hqureshi · September 19, 2023, 10:37pm

I’m testing an MPICH application on a single P100 with 16GB of memory and Cuda 11.8, and I’m running into an error I don’t understand. I’m testing with 16 ranks, where each rank calls cufftPlan1d(&plan, 512, CUFFT_Z2Z, 16384). When this happens, the majority of the ranks return a CUFFT_INTERNAL_ERROR, and even though MPI_Abort is called, all the processes hang and cannot be killed.

Everything is fine with 16 ranks and cufftPlan1d(&plan, 256, CUFFT_Z2Z, 4096), and 8 ranks with cufftPlan1d(&plan, 512, CUFFT_Z2Z, 32678), so I don’t think there are any issues with running out of memory (also (16 ranks) * (512 * 8 * 2 * 2 bytes) * (16384 transforms) is only around 4 GB, and there is no other GPU memory allocated).

I firstly don’t understand why I’m getting a CUFFT_INTERNAL_ERROR as there should be enough memory and 512 is a power of 2. Secondly, I don’t understand why the processes hang afterward, even though MPI_Abort is called (although I suppose this may be an MPICH issue rather than a Cuda issue.

A snippet from the code is below. Here GPUPlanManager is an object that stores plan handles along with information about the transform size, number of transforms, and the type. find_plan will find the corresponding plan, or create one if it doesn’t exist.

cufftHandle GPUPlanManager::find_plan(int ng, int nFFTs, cufftType t){
    for (int i = 0; i < N_FFT_CACHE; i++){
        if (plans[i].valid){
            if ((plans[i].ng == ng) && (plans[i].nFFTs == nFFTs) && (plans[i].t == t)){
                return plans[i].plan;
            }
        } else {
            plans[i].valid = true;
            plans[i].ng = ng;
            plans[i].nFFTs = nFFTs;
            plans[i].t = t;
            cufftResult_t err = cufftPlan1d(&plans[i].plan,ng,t,nFFTs);
            if (err != CUFFT_SUCCESS){
                printf("CUFFT error: Plan creation failed with %s (ng = %d, nFFTs = %d)\n",_cudaGetErrorEnum(err),ng,nFFTs);
                MPI_Abort(MPI_COMM_WORLD,err);
            }
            return plans[i].plan;

        }
    }
    printf("Out of space for plans!\n");
    MPI_Abort(MPI_COMM_WORLD,1)
}

Robert_Crovella · September 20, 2023, 1:28am

I would investigate memory first. Unless you are using MPS, each rank is going to create its own context on the GPU, which uses up substantial memory (possibly several hundred megabytes per rank). In addition, each plan generally consumes memory resources for temporary buffer space independent of input and output buffers (and you appear to have the potential for many plans per rank). Do a cudaMemGetInfo() right before the MPI_Abort.

Any time I had a problem like this, if an immediate resolution was not found, I would also try it on the latest available CUDA version.

Topic		Replies	Views
CUFFT_INTERNAL_ERROR while running cufftPlan1d GPU-Accelerated Libraries	9	9363	January 5, 2021
CUFFT_INTERNAL_ERROR during creation of a 1D Plan in CUFFT GPU-Accelerated Libraries cuda , cufft	11	3898	October 19, 2022
CUFFT_INTERNAL_ERROR --> Maybe it's a bug. I suspect that is a serious BUG in cufft CUDA Programming and Performance	2	2770	July 5, 2008
INTERNAL_ERROR cufftPlan2d() big problem CUDA Programming and Performance	1	4247	September 14, 2007
CUFFT_INTERNAL_ERROR Error while executing cufftPlan2d() CUDA Programming and Performance	3	7956	July 13, 2008
CuFFT :: Invalid Plan CUDA Programming and Performance	2	3223	June 17, 2009
CUFFT problem invalid plan / internal error CUDA Programming and Performance	5	3952	December 21, 2009
`CUFFT_INTERNAL_ERROR` when using `cufftPlan` with 1d or 2d in any size GPU-Accelerated Libraries cufft	6	282	July 14, 2024
cufftPlanMany CUFFT_INTERNAL_ERROR on previously working unit test GPU-Accelerated Libraries cufft	5	1660	April 8, 2024
why cufftPlan needs such many GPU mem? CUDA Programming and Performance	1	5975	January 10, 2011

CUFFT_INTERNAL_ERROR in cufftPlan1d in MPI code causes unkillable MPI Processes

Related topics