cufftMP slow plan creation and execution on multiple nodes

Good morning!

I text you because I’m experiencing a problem with cufftMP when going onto 32, 64, 128 Leonardo nodes (1 Intel Xeon 32 c + 4 NVIDIA A100 64 GB with nvlink, currently 6th in Top500) respectively, compared to standard MPI+OpenMP FFTW. My code essentially does 128 FFTs.
I found that the total runtimes are100.578, 116.258 and 138.958 on 32,64,128 Leonardo nodes.
Among them, plan creations take 23.3072, 28.5442, 39.8685, and the actual computations take 76.5412, 86.8475, 97.9725 respectively.
With standard FFTW, total runtimes take 16.7602, 14.6465, 14.132.
The HPC-SDK version I’m currently using is 23.11.

I thought that something is happening in communications which causes this dramatic slow down, but I would like to report this issue I’m having.

FFTs are 128 simple complex-to-complex 2D Fourier Transforms.

Thank you in advance,
Giovanni.

Hi ,

Thanks for reaching out . We may need more details to take a look at this in house . Can you please follow How to report a bug to report us a bug ticket which will sync and keep a communication channel with you and interact with our engineering team .

In the bug report , we want to know details following and please kindly include them as possible as you can

  1. What is the result data unit , milliseconds ?
  2. What is ‘128 FFTs’ , do you mean 2D or 3D ?
  3. What is the used data decomposition
  4. Is your comparison “standard FFTW” from a single node ?
  5. Most important , we may want a self-contained reproducer to check it practically in house

Attachment can be added to NVSDKIssues@nvidia.com

Best,
Yuki