CUFFT_INTERNAL_ERROR in cufftPlan1d in MPI code causes unkillable MPI Processes

I’m testing an MPICH application on a single P100 with 16GB of memory and Cuda 11.8, and I’m running into an error I don’t understand. I’m testing with 16 ranks, where each rank calls cufftPlan1d(&plan, 512, CUFFT_Z2Z, 16384). When this happens, the majority of the ranks return a CUFFT_INTERNAL_ERROR, and even though MPI_Abort is called, all the processes hang and cannot be killed.

Everything is fine with 16 ranks and cufftPlan1d(&plan, 256, CUFFT_Z2Z, 4096), and 8 ranks with cufftPlan1d(&plan, 512, CUFFT_Z2Z, 32678), so I don’t think there are any issues with running out of memory (also (16 ranks) * (512 * 8 * 2 * 2 bytes) * (16384 transforms) is only around 4 GB, and there is no other GPU memory allocated).

I firstly don’t understand why I’m getting a CUFFT_INTERNAL_ERROR as there should be enough memory and 512 is a power of 2. Secondly, I don’t understand why the processes hang afterward, even though MPI_Abort is called (although I suppose this may be an MPICH issue rather than a Cuda issue.

A snippet from the code is below. Here GPUPlanManager is an object that stores plan handles along with information about the transform size, number of transforms, and the type. find_plan will find the corresponding plan, or create one if it doesn’t exist.

cufftHandle GPUPlanManager::find_plan(int ng, int nFFTs, cufftType t){
    for (int i = 0; i < N_FFT_CACHE; i++){
        if (plans[i].valid){
            if ((plans[i].ng == ng) && (plans[i].nFFTs == nFFTs) && (plans[i].t == t)){
                return plans[i].plan;
            }
        } else {
            plans[i].valid = true;
            plans[i].ng = ng;
            plans[i].nFFTs = nFFTs;
            plans[i].t = t;
            cufftResult_t err = cufftPlan1d(&plans[i].plan,ng,t,nFFTs);
            if (err != CUFFT_SUCCESS){
                printf("CUFFT error: Plan creation failed with %s (ng = %d, nFFTs = %d)\n",_cudaGetErrorEnum(err),ng,nFFTs);
                MPI_Abort(MPI_COMM_WORLD,err);
            }
            return plans[i].plan;

        }
    }
    printf("Out of space for plans!\n");
    MPI_Abort(MPI_COMM_WORLD,1)
}

I would investigate memory first. Unless you are using MPS, each rank is going to create its own context on the GPU, which uses up substantial memory (possibly several hundred megabytes per rank). In addition, each plan generally consumes memory resources for temporary buffer space independent of input and output buffers (and you appear to have the potential for many plans per rank). Do a cudaMemGetInfo() right before the MPI_Abort.

Any time I had a problem like this, if an immediate resolution was not found, I would also try it on the latest available CUDA version.