I am currently implementing the interface between Kokkos and FFT.
When trying cuFFTMp inside a small example on A100 supercomputer, I encountered a scalability issue. The performance is measured using Kokkos-tools. This mini-app calls 12 Inverse FFT (C2R) + 3 Forward FFT (R2C) repeated 4 Runge-Kutta steps in a single time step. The following tables include the averaged elapsed time for 1 time step that is relate to FFTs.
Performance with our implementation using MPI
# MPI
Architecture
All2all
Unpack
Pack
FFT
Transpose
Normalize
16
A100 (MPI)
1.4119186 [s]
0.1384448 [s]
0.1340588 [s]
0.1574669 [s]
0.0641185 [s]
0.0378279 [s]
16
H200 (MPI)
1.3911636 [s]
0.0313026 [s]
0.0352117 [s]
0.0642570 [s]
0.0240311 [s]
0.0158418 [s]
32
A100 (MPI)
1.6854427 [s]
0.2153435 [s]
0.2045072 [s]
0.1917983 [s]
0.0685456 [s]
0.0208990 [s]
32
H200 (MPI)
0.7442848 [s]
0.0137397 [s]
0.0155462 [s]
0.0334599 [s]
0.0119456 [s]
0.0081651 [s]
64
A100 (MPI)
0.9532557 [s]
0.0728108 [s]
0.0686664 [s]
0.0985339 [s]
0.0322849 [s]
0.0107661 [s]
64
H200 (MPI)
0.8290491 [s]
0.0067909 [s]
0.0076418 [s]
0.0178711 [s]
0.0062001 [s]
0.0043301 [s]
Performance with cuFFTMp backend
# MPI
Architecture
cufftXtExecDescriptor
DeepCopy
Normalize
16
A100 (cuFFTMp)
5.5505526 [s]
0.0619220 [s]
0.0377528 [s]
16
H200 (cuFFTMp)
1.2969191 [s]
0.0384167 [s]
0.0157858 [s]
32
A100 (cuFFTMp)
6.1734890 [s]
0.0278820 [s]
0.0189375 [s]
32
H200 (cuFFTMp)
0.6746436 [s]
0.0195230 [s]
0.0081156 [s]
64
A100 (cuFFTMp)
6.7523435 [s]
0.0143644 [s]
0.0093793 [s]
64
H200 (cuFFTMp)
0.5940724 [s]
0.0100590 [s]
0.0042695 [s]
cuFFTMp scales nicely on H200 platform with 1 GPU per node, but did not scale on A100 platform with 8 GPUs per node.
I would like to know how to resolve the performance issue on A100 platform.
Comments or suggestions are highly appreciated.
We have measured the timing in the following way.
Environment nvhpc 25.9, nvcc 12.9.86
Get source codes
git clone --recursive git@github.com:yasahi-hpc/distributed-FFT-for-kokkos.git
Hi dear developer , please try to track us a bug by following How to report a bug given that this might need internal more engineering resource and a longer cycle to talk on. We will interact with you in the bug ticket. Quoting this Forum link in bug description is fine to save duplicated write-up efforts. We just need the channel in the bug ticket for direct communication with you. Thanks.
We are sharing ticket 5926003 conclusion here.
Initially , the intra-node scaling is checked good on Kokkos internally, but we do not have inter-node A100 cluster to give a check. Then we are trying to analyse the nsys report.
There are 2 ways to collect nsight system reports , nsys profile mipexec vs. mpiexec nsys profile on clusters. The former looks like only for intra-node , the latter is for inter-node profile.
To avoid nsys hook to mpiexec which doesn’t directly launch any cuda kernel , one can generate report for each rank -o /path/to/output/report_%q{OMPI_COMM_WORLD_RANK} .
After investigating the reports , we see NVSHMEM barrier/sync kernels dominate GPU time in multi-node. We suspect the core binding issue is the culprit here.
Since cuFFTMp uses NVSHMEM and that NVSHMEM spawns a proxy thread, the user should ensure every process has exclusive access to at least two CPU cores.
The core binding/affinity should be visible from NVSHMEM traces. We asked the customer to
Try adding the option --bind-to none to the MPI launcher. E.g., mpirun --bind-to none -np 16 build_gpu/examples/navier-stokes-MPI/navier-stokes-MPI -px 16 -Re 1600 -dt 0.001 -nx 1024 -nbiter 10 -suppress_diag true , AND provide NVSHMEM traces as well.
Based on the --bind-to none reports , we conclude below from our cufft engineer .
=======
cuFFTMp uses non-blocking RMA (nvshmem_TYPE_put_nbi) for inter-node comms, so the long barrier you observe is mostly just waiting on IB messages to complete. A fair comparison would be comparing intra-node (reshape_inplace_rs kernel + barrier) runtime vs internode (reshape_inplace_rs kernel + barrier) runtime. The ratio should be roughly inversely proportional to NVLink vs IB BW ratio for the same message size.
To answer the customer’s question, we took a look at the logs and nsys reports shared. We can confirm the behavior is expected and the issue is resolved.
Evidence 1:
Before the binding change: we noticed CPU utilization is staggered between the main thread and the NVSHEM proxy thread, which indicates the two threads are likely using the same CPU core. So when the physical core is busy with the application thread, the proxy thread that’s in charge of comms is not available and sync has to wait longer. There is also only a single core available under each CPU (in the CPUs category) even though it’s labeled CPU(72).
After the binding change: I now see full CPU utilization for both the application thread and the proxy thread. It also looks like all cores available under each CPU. The two threads are on separate cores. On the GPU side, sync time is almost negligible, with 100% CPU utilization for both threads.
From the NVSHMEM logs, we can see all cores available for each PE. E.g., wa13:332:332 [0] NVSHMEM INFO PE 0 (process) affinity to 72 CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
We actually wanted to ask the customer to share two NVSHMEM logs for before vs after. I wasn’t clear in the instruction and the customer only shared the one after changing bindings. But I think the customer would see something like the following with the old commands where only one core is available. wa13:332:332 [0] NVSHMEM INFO PE 0 (process) affinity to 72 CPUs: 0
===========
As we can see, the root cause is improper CPU core binding. The default MPI launcher was binding each process to a single CPU core, causing the NVSHMEM proxy thread (responsible for inter-node communication) to compete with the application thread for the same core. This resulted in serialized execution and abnormally long NVSHMEM barrier/sync times (~32 ms inter-node vs. 10 μs intra-node). The fix was simply adding --bind-to none to the mpirun command, ensuring each process had access to multiple CPU cores. This is not a cuFFTMp bug but a user configuration issue. NVIDIA committed to improving documentation around NVSHMEM CPU affinity requirements for cuFFTMp. On our end, we will work on how to best produce a warning for users to avoid this + add --bind-to none explicitly to our doc on NVSHMEM bindings.