Unwanted duplicate threads/processes on dual P6000

On a workstation with
Centos 8, 4.18.0-193.19.1.el8_2.x86_64, driver 455.23.05,
We are experiencing memory-related exceptions on compute jobs that were completing successfully on same hardware prior to migration from ubuntu. We noticed that jobs spawned twice the number of threads requested, and wonder whether the display-related duplication of processes over two GPUs (see nvidia-smi snippet below) is a symptom of the same problem.
Creating a new xorg.conf (after deleting old) with the --busid=PCI:... nvidia-xconfig modifier fails to prevent duplicate Xorg threads. Could this duplication be due to some hardware or bios configuration?

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     17689      G   /usr/libexec/Xorg                  58MiB |
|    0   N/A  N/A     17834      G   /usr/bin/gnome-shell               16MiB |
|    1   N/A  N/A     17689      G   /usr/libexec/Xorg                  58MiB |
|    1   N/A  N/A     17834      G   /usr/bin/gnome-shell               16MiB |
+-----------------------------------------------------------------------------+

I had the same problem with the same CentOS version and nvidia driver version as you describe, on two quad-GPU workstations at work (one quad-GTX 1070 Ti and one quad-RTX 2080 Ti). Downgrading to driver version 440.33.01 (the default one with CUDA 10.2) fixed it on both workstations. I don’t know why the latest driver did that, but I will keep this slightly older driver version because I need a readable output of nvidia-smi (I use it a lot to monitor the initial iterations of long-running GPU compute jobs; not convenient if it is cluttered like that).

I have the same problem with Almalinux 8.5 (RHEL based). I have two A6000 GPUs. I have installed different nvidia-drivers (460+) and always I have duplicate processes on both GPUs.
I don’t know if this is expected behaviour, but I suspect that I have access only to the half GPU memory (ie. only in one card). Always both GPUs have the same memory allocated.

In the python code only the second GPU is being utilized, but as you can see memory is being allocated on the first one also. As a consequence, if I want to run two processes, each one on different GPU, the memory is consumed fast because the duplicate process allocates memory on the other GPU.

Please check if this helps:
https://forums.developer.nvidia.com/t/memory-is-allocated-on-all-gpus/183110/2?u=generix

1 Like

You saved me one more day of googling and experimenting with different drivers, before proceeding to install Ubuntu.
I disabled BaseMosaic in “/etc/X11/xorg.conf.d/10-nvidia.conf” and everything works as expected.

Thank you so much!! :)