MPI_Waitall profiled multiple times

Hi,

I have noticed that MPI_Waitall is listed several times with the same timestamp in the MPI event trace. To see this, consider this code

int main (int argc, char *argv[]) {
        int rank, n_ranks;
        int n_per_rank = 100;
        MPI_Request requests[4];
        MPI_Status status[4];
        int buffer[400];

        MPI_Init (NULL, NULL);
        MPI_Comm_size (MPI_COMM_WORLD, &n_ranks);
        MPI_Comm_rank (MPI_COMM_WORLD, &rank);


        if (rank == 0) {
                for (int i = 0; i < n_ranks * 100; i++) {
                        buffer[i] = i / 100;
                }
                for (int i = 0; i < n_ranks - 1; i++) {
                        MPI_Isend(&buffer[i*100], 100, MPI_INT, i+1, 123, MPI_COMM_WORLD, &requests[i]);
                }
                MPI_Waitall (n_ranks-1, requests, status);
        } else {
                MPI_Recv(buffer, 100, MPI_INT, 0, 123, MPI_COMM_WORLD, &status[0]);
        }
        MPI_Finalize();
        return 0;
}

I compile it with nvc 24.0.1 and the used MPI is OpenMPI 4.1.7. I run it with mpirun -np 4 nsys profile -t mpi -o report_%q{OMPI_COMM_WORLD_RANK} ./test.x. I then generate the report for rank 0 with nsys export -t sqlite report_0.nsys-rep; nsys stats -r mpi_event_trace report_0.sqlite. This gives the following three entries in the event table

 5,458,412,790  5,458,414,107          1,317  MPI_Waitall   214,012  214,012          0                                                                     
 5,458,412,790  5,458,414,107          1,317  MPI_Waitall   214,012  214,012          0                                                                     
 5,458,412,790  5,458,414,107          1,317  MPI_Waitall   214,012  214,012          0      

I assume that this due to the three open requests generated by MPI_Isend beforehand. Is my understanding correct?
The problem arises when I want to sum up all the MPI runtimes in order to get the total runtime for each MPI function. Is there a way to avoid the repeated listing of such functions?

@rdietrich can you give a hand with this one?

Hi Christian!

The exported SQLite database table MPI_START_WAIT_EVENTS will contain one MPI_Waitall row for each MPI request it is/was waiting for. I see that this doesn’t make a lot of sense for the nsys stats output, since it does not even show the requests.

For now you can directly query the MPI_START_WAIT_EVENTS database table, e.g. with
select DISTINCT MPI_RANKS.rank, start, end, StringIds.value from MPI_START_WAIT_EVENTS inner join StringIds on StringIds.id = MPI_START_WAIT_EVENTS.textId inner join MPI_RANKS on MPI_RANKS.globalTid = MPI_START_WAIT_EVENTS.globalTid order by MPI_START_WAIT_EVENTS.start;

or add “DISTINCT” after SELECT in the stats script host-linux-x64/reports/mpi_event_trace.py (in your Nsight Systems install path) in line 51.

I also opened a ticket to improve/fix the nsys stats output.

1 Like

Thanks for helping out here. Using the SQL database is a bit too much for my purpose, but I can get proper statistics by only adding up the first MPI_Waitall entry in the MPI event trace (identified by the timestamp).
Speaking of which: There is for example a CUDA API trace and a CUDA API summary, but for MPI, there is only the trace. Is there a reason why a summary is missing or could this be included as well?

The MPI API summary has been added in Nsight Systems 2024.1.1. However, it looks like we need to fix the mpi_event_sum script as well.