I’m testing an MPICH application on a single P100 with 16GB of memory and Cuda 11.8, and I’m running into an error I don’t understand. I’m testing with 16 ranks, where each rank calls cufftPlan1d(&plan, 512, CUFFT_Z2Z, 16384)
. When this happens, the majority of the ranks return a CUFFT_INTERNAL_ERROR
, and even though MPI_Abort
is called, all the processes hang and cannot be killed.
Everything is fine with 16 ranks and cufftPlan1d(&plan, 256, CUFFT_Z2Z, 4096)
, and 8 ranks with cufftPlan1d(&plan, 512, CUFFT_Z2Z, 32678)
, so I don’t think there are any issues with running out of memory (also (16 ranks) * (512 * 8 * 2 * 2 bytes) * (16384 transforms) is only around 4 GB, and there is no other GPU memory allocated).
I firstly don’t understand why I’m getting a CUFFT_INTERNAL_ERROR
as there should be enough memory and 512 is a power of 2. Secondly, I don’t understand why the processes hang afterward, even though MPI_Abort
is called (although I suppose this may be an MPICH issue rather than a Cuda issue.
A snippet from the code is below. Here GPUPlanManager
is an object that stores plan handles along with information about the transform size, number of transforms, and the type. find_plan
will find the corresponding plan, or create one if it doesn’t exist.
cufftHandle GPUPlanManager::find_plan(int ng, int nFFTs, cufftType t){
for (int i = 0; i < N_FFT_CACHE; i++){
if (plans[i].valid){
if ((plans[i].ng == ng) && (plans[i].nFFTs == nFFTs) && (plans[i].t == t)){
return plans[i].plan;
}
} else {
plans[i].valid = true;
plans[i].ng = ng;
plans[i].nFFTs = nFFTs;
plans[i].t = t;
cufftResult_t err = cufftPlan1d(&plans[i].plan,ng,t,nFFTs);
if (err != CUFFT_SUCCESS){
printf("CUFFT error: Plan creation failed with %s (ng = %d, nFFTs = %d)\n",_cudaGetErrorEnum(err),ng,nFFTs);
MPI_Abort(MPI_COMM_WORLD,err);
}
return plans[i].plan;
}
}
printf("Out of space for plans!\n");
MPI_Abort(MPI_COMM_WORLD,1)
}