MPI Multi-GPU process list in nvidia-smi

Hi,

I noticed a strange change in behavior for running my multi-GPU MPI+OpenACC code. I think the change occurred in the last few driver updates (or maybe CUDA update?) (I am using Ubuntu 20.04).

Basically, when I run my code on 4 GPUs with “mpiexec -np 4”, I would usually only see 4 processes in the list shown with nvidia-smi.
Now when I do it, I see 16 processes listed, and for each GPU, three of the processes have 0 memory/activity and correspond to the process with activity on another GPU.
Are these communication processes?
Why was this not shown before?

When I run htop, I only see 4 CPU processes as before.

nvidia-smi output:

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3731 C ./mas 649MiB |
| 0 N/A N/A 3732 C ./mas 0MiB |
| 0 N/A N/A 3733 C ./mas 0MiB |
| 0 N/A N/A 3734 C ./mas 0MiB |
| 1 N/A N/A 3731 C ./mas 0MiB |
| 1 N/A N/A 3732 C ./mas 645MiB |
| 1 N/A N/A 3733 C ./mas 0MiB |
| 1 N/A N/A 3734 C ./mas 0MiB |
| 2 N/A N/A 3731 C ./mas 0MiB |
| 2 N/A N/A 3732 C ./mas 0MiB |
| 2 N/A N/A 3733 C ./mas 645MiB |
| 2 N/A N/A 3734 C ./mas 0MiB |
| 3 N/A N/A 3731 C ./mas 0MiB |
| 3 N/A N/A 3732 C ./mas 0MiB |
| 3 N/A N/A 3733 C ./mas 0MiB |
| 3 N/A N/A 3734 C ./mas 645MiB |
±----------------------------------------------------------------------------+

  • Ron

Hi Ron,

What driver version are you using? Unfortunately I’ve not seen this before so don’t know what’s going on. The fact that the extra processes have no memory is odd, but rules out that they are extra contexts that are being created.

Maybe running the code through Nsight-Systems would show where these are coming from? Does performance seem to be effected?

-Mat

Hi,

The performance does not seem to be affected (whew!).

I am using:
Driver Version: 455.32.00 CUDA Version: 11.1

On:
Ubuntu 20.04 kernel 5.4.0-52-generic
using
NVHPC 20.9 with the OpenMPI 3 it comes with + OpenACC

My topology is:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity
GPU0 X NV2 SYS SYS 0-127 N/A
GPU1 NV2 X PHB SYS 0-127 N/A
GPU2 SYS PHB X NV2 0-127 N/A

GPU3 SYS SYS NV2 X 0-127 N/A

  • Ron

Hi Ron,

I was able to reproduce this on a system with a CUDA 11.1 driver. As far as we can tell, this appears to be an extra reporting of the processes running on the other GPU and most likely benign.

-Mat

Hi,

OK good to know.

So this is an issue with nvidia-smi itself?

  • Ron

So this is an issue with nvidia-smi itself?

Sorry, I’m not sure is it’s from nvidia-smi or the driver.