Nsys Profile with MPMD(multiple program and multiple data) simulation

Hello,

I am trying to profile a MPI+OPENACC program with nsys.
I am using OpenMPI(3.1.6) from Nvidia HPC SDK(20.7) with UCX enabled.
There are three exectuables, exec1, exec2, exec3. I want to profile for exec3. But I am failing.
Following is the run script:-

#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --ntasks-per-node=40
#SBATCH --output=app.out
#SBATCH --error=app.err
#SBATCH -p Intel_6248_2s_20c_2t_GPU_hdr100_192GB_2933
#SBATCH --exclusive
#SBATCH --gres=gpu:4

WRAPPER=/run/acc_round_robin.sh

exec1=$workdir/exec/prog1
exec2=$workdir/exec/prog2
exec3=$workdir/exec/prog3

echo “0 $WRAPPER $exec1”> $workdir/file.conf
echo “2-9,11-19,21-29,32-39 $WRAPPER $exec2”>> $workdir/file.conf
echo “nsys profile 1,10,20,30,31 $WRAPPER $exec3”>> $workdir/file.conf

echo “#!/bin/bash” > $workdir/file1_cmd
echo “srun --multi-prog $workdir/file.conf” >> $workdir/file1_cmd

echo “exit 1” >> $workdir/file1_cmd
chmod +x $workdir/file1_cmd

/usr/bin/time ./FORECAST forecast ./config
date
TEND=echo "print time();" | perl

echo “++++ Total elapsed time expr $TEND - $TBEGIN seconds”

Run:- sbatch run.sh

I just noticed this from you’re original post. We only support UCX with our OpenMPI 4.x build and the first release to include OpenMP 4.0.5 was 20.11 as a Beta. In other words, if you’re using NVHPC 20.7, then you’re not use UCX.

Maybe try with NVHPC 21.3 and OpenMPI 4 if UCX is needed? Or don’t use the “acc_round_robin.sh” script in case the setting of the environment variables are causing issues?

-Mat

Hi Mat, thanks for the reply. My application running well with OpenMPI(3.1.6) from Nvidia HPC SDK(20.7) with UCX enabled.
I tried with the combination you told, and my my application is getting crashed. Even i tried with other combinations as well. Meaning, tried with different version of OPENMPI and HPC SDK. But still its getting crashed.

If you have the chance, please try the most recent version of Nsight Systems (2021.2.1). You can also try different options to the trace switch, e.g. -t mpi,cuda,nvtx (disables osrt, which caused problems with UCX in the past).

Could you provide some more details on the crash. A backtrace would be best. Could you also isolate the final nsys profile ... command?

Hi Rdietrich, I have reduced the execution time of the application. Previous it was around 20 mins.(#SBATCH --time=0:20:00). Now I have reduced to 10 mins. After that it has started collecting all 5 reports.

But now I have some other problem, i.e its not showing any cuda/openacc traces in the profile. I checked the diagnostic summary of the profile, it is showing 6 warning:-

1)Installed CUDA driver version (11.2) is not supported by this build of Nsight Systems. CUDA trace will be collected using libraries for driver version 11.0.

2)Installed CUDA driver version (11.2) is not supported by this build of Nsight Systems. CUDA trace will be collected using libraries for driver version 11.0

3)No CUDA events collected. Does the process use CUDA?

4)Not all OpenACC events might have been collected.

5)No OpenACC events collected. Does the process use OpenACC?

6)Failed to connect to the application. Has it been run with Injection library?

Nsight System Version:- 2021.2.1
Cuda version :- 11.2
Driver version:- 460.32.03
NVIDIA-SMI :- 460.32.03

I tried to change the cuda version to 11.0, still the same issue is there.

I am not able to get what can be the other issue for this ?

The warnings 1) and 2) should not occur with Nsight Systems 2021.2.1, which supports CUDA 11.3. Can you check that the paths to Nsight Systems are set correctly. There should be a library libcupti.so.11.2 in the target-linux-64 folder of your nsys installation. (I assume Linux on x86.) Maybe you are still using the Nsight System that is shipped with the HPC SDK.

Otherwise, does it work for a simple OpenACC app, a simple MPI+OpenACC app?

Hi Rdietrich, my problem is resolved now. You were correct I was using the Nsight System that is shipped with the HPC SDK. I changed that path to the system installed Nsight System.

Thanks and Regards,
Hemant Giri