Acc_get_num_devices only find one of the two GPUs (nvhpc/24.3)

Hi!
I do not know if it is an OpenACC question or a bad configuration of my host. I had some troubles with nvhpc/24.1 running a MPI code using 2 GPUs and I’ve updated to nvhpc/24.3 and updated my drivers to nvidia-driver-550.54.15-1.el8.x86_64 at the same time.
But now calling numDevice = acc_get_num_devices(acc_get_device_type()) in fortran only returns one GPU available.

The node has 2 GPU A100, one with 40GB and the other one with 80GB of RAM.

lspci do not identify properly the latest (I suppose):

# lspci |grep NVIDIA
25:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
e2:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)

The nvidia kernels module are loaded howerver:

# lspci -k -s 25:00.0
25:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
	Subsystem: NVIDIA Corporation Device 145f
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia
# lspci -k -s e2:00.0
e2:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
	Subsystem: NVIDIA Corporation Device 1533
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia

And nvidia-smi shows the two devices properly:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off |   00000000:25:00.0 Off |                    0 |
| N/A   30C    P0             35W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off |   00000000:E2:00.0 Off |                    0 |
| N/A   32C    P0             65W /  300W |       0MiB /  81920MiB |     20%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

So I do not understand if there is a setup problem or if it is the acc_get_device_type() call which make a filter in my acc_get_num_devices(...) call with this new nvhpc version. I do not unterstand how to make a call for all the available devices.

I’ve tried to add the flags pci=realloc or pci=realloc=off as suggested by A100 V100 Driver not working - #3 by sladewang1995 but it doesn’t help.

I’m loading nvhpc-hpcx-cuda12/24.3 to build a small Fortran + MPI +OpenACC test case.

Following are the installed packages from Nvidia (if a conflict may exist too):

  • nvidia-driver-cuda-libs-550.54.15-1.el8.x86_64

  • nvidia-driver-cuda-550.54.15-1.el8.x86_64

  • nvidia-libXNVCtrl-550.54.15-1.el8.x86_64

  • nvidia-modprobe-550.54.15-1.el8.x86_64

  • nvidia-fs-2.19.7-1.x86_64

  • nvidia-driver-550.54.15-1.el8.x86_64

  • nvidia-driver-NVML-550.54.15-1.el8.x86_64

  • nvidia-driver-devel-550.54.15-1.el8.x86_64

  • nvidia-xconfig-550.54.15-1.el8.x86_64

  • nvidia-gds-12-4-12.4.1-1.x86_64

  • dnf-plugin-nvidia-2.0-1.el8.noarch

  • nvidia-kmod-common-550.54.15-1.el8.noarch

  • nvidia-persistenced-550.54.15-1.el8.x86_64

  • nvidia-libXNVCtrl-devel-550.54.15-1.el8.x86_64

  • nvidia-gds-12.4.1-1.x86_64

  • kmod-nvidia-latest-dkms-550.54.15-1.el8.x86_64

  • nvidia-settings-550.54.15-1.el8.x86_64

  • nvidia-fs-dkms-2.19.7-1.x86_64

  • nvidia-driver-libs-550.54.15-1.el8.x86_64

  • nvidia-driver-NvFBCOpenGL-550.54.15-1.el8.x86_64

  • nvhpc-24-3-24.3-1.x86_64

  • nvhpc-24.3-1.x86_64

Thanks for your help.

Patrick

Hi Patrick,

Do you have in your environment or is set by a script or batch scheduler the environment variable “CUDA_VISIBLE_DEVICES” set?

If so, then this masks out the other device so the runtime would only see one device.

If not, then can you run the “nvaccelinfo” utility? This uses the same query as our OpenACC runtime.

-Mat

Hi Mat,

I’ve found the error! After a long coding day and upgrade of the GPU node I’ve missed the -acc flag at compile time with the new sdk version for my test-case. Without it:

  • nvfortran compiles the code (openacc module is found so acc_get_num_devices is known)
  • linking the executable works, as no symbols remains undefined with the acc_get_num_devices call
  • but the binary returns wrong results.

With the -acc flag the code execution is correct and the 2 GPUs are found.

Another mistake of myself.

Patrick

Glad you were able to find the problem!

Any idea about why building the executable works when calling acc_get_num_devices without the -acc flag but the result of this function is wrong ?

Without the “-acc” flag, the device type is the host so number of devices would be 1.

1 Like