I am developing library that performs FFT in 3d and 2d using MPI. Recently I started testing this library with cuFFT as a device 1d FFT executor. My library uses MPI derived datatypes to send and receive aligned data.
I noticed that 99.99% of time I run code on gpu is spent on single MPI_Alltoall call. When I did a profile on the execution it showed that there were more than 2 million calls to MemCpy (HtoD).
Profile of a single rank can be found here
Can somebody explain how this works and why is this happening?
I am using PGI 20.4.