HPL NGC Container - libnvidia-ml.so.1: cannot open shared object file:

puneet336 · October 1, 2021, 3:28pm

Hi,
I am trying to run HPL benchmark as -

[user@node1 ~]$ docker container run --interactive --privileged --tty  nvcr.io/nvidia/hpc-benchmarks:21.4-hpl /bin/bash
Detected MOFED 5.4-1.0.3.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
      Multi-node communication performance may be reduced.

root@d112f848d074:/workspace#  mpirun --bind-to none -np 4 hpl.sh --config dgx-a100 --dat HPL.dat
INFO: host=d112f848d074 rank=0 lrank=0 cores=16 gpu=0 cpu=32-47 mem=2 net=mlx5_0:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=d112f848d074 rank=1 lrank=1 cores=16 gpu=1 cpu=48-63 mem=3 net=mlx5_1:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=d112f848d074 rank=2 lrank=2 cores=16 gpu=2 cpu=0-15 mem=0 net=mlx5_2:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=d112f848d074 rank=3 lrank=3 cores=16 gpu=3 cpu=16-31 mem=1 net=mlx5_3:1 bin=/workspace/hpl-linux-x86_64/xhpl
/workspace/hpl-linux-x86_64/xhpl: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/workspace/hpl-linux-x86_64/xhpl: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
/workspace/hpl-linux-x86_64/xhpl: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21697,1],1]
  Exit code:    127
--------------------------------------------------------------------------

i can see that the libnvidia-ml.so file is not being detected inside the container -

root@d112f848d074:/workspace# ldd /workspace/hpl-linux-x86_64/xhpl
        linux-vdso.so.1 (0x00008000000d2000)
        libmkl_intel_lp64.so => /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so (0x00007ffff70b3000)
        libmkl_intel_thread.so => /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_intel_thread.so (0x00007ffff37c8000)
        libmkl_core.so => /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_core.so (0x00007fffef187000)
        libcudart.so.11.0 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007fffeeef8000)
        libcublas.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.11 (0x00007fffe7f35000)
        libnvidia-ml.so.1 => not found
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fffe7b97000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fffe7993000)
        libmpi.so.40 => /usr/local/openmpi/lib/libmpi.so.40 (0x00007fffe765f000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fffe7440000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fffe704f000)
        /lib64/ld-linux-x86-64.so.2 (0x00007ffff7dd3000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fffe6e47000)
        libcublasLt.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.11 (0x00007fffdb266000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fffdaedd000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fffdacc5000)
        libopen-rte.so.40 => /usr/local/openmpi/lib/libopen-rte.so.40 (0x00007fffdaa0f000)
        libopen-pal.so.40 => /usr/local/openmpi/lib/libopen-pal.so.40 (0x00007fffda6f5000)
        libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fffda4f2000)

in addition the nvidia-smi utility is also unavailable in the container -

root@d112f848d074:/workspace# nvidia-smi
bash: nvidia-smi: command not found

what additional steps do i need to take to get libnvidia-ml.so.1 library and nvidia-smi command inside the container?

puneet336 · October 1, 2021, 4:52pm

i noticed that the container’s driver version (/proc/driver/nvidia/version) was -
NVIDIA UNIX x86_64 Kernel Module 460.32.03
so with -
apt-get install nvidia-utils-460
both nvidia-smi and nvidia-ml libraries are available (460.91.03-0ubuntu1).

hpl works fine, but with nvidia-smi , i get -

root@2469b84f08a7:/workspace# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Is there any reason behind not including the nvidia-smi and nvidia-ml with the container image?

Topic		Replies	Views
Run HPL benckmark 23.3 on A800(80GB) GPU-Accelerated Libraries cuda	0	1196	April 20, 2023
The problem of installing and using the NVhpc SDK nvc, nvc++ and nvfortran	3	596	January 23, 2024
Would like some help in running the xhpl 21.4 container on slurm Container: HPC	0	1145	November 4, 2022
Nvidia docker nvcr.io/nvidia/hpc-benchmarks:23.10 HPL running error at HPC ARM Developer-kit Container: HPC cuda	2	1339	February 22, 2024
Nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory CUDA Setup and Installation	1	1395	April 5, 2024
Run hpc_benchmark23.10 HPL with v100GPU GPU-Accelerated Libraries hpc , benchmarks , hpc-x	3	1546	January 25, 2024
NVHPC Code with Multiple GPUs inside Singularity Container gives UCX Error GPU-Accelerated Libraries	0	214	May 6, 2024
HPL run fails (libmkl_intel_lp64.so: cannot open shared object file) CUDA Setup and Installation	1	1887	April 2, 2018
Unable to run gst-inspect inside deepstream container DeepStream SDK gstreamer	12	1176	December 28, 2023
Run HPL on 4x A100 CUDA Programming and Performance	3	3077	July 17, 2021

HPL NGC Container - libnvidia-ml.so.1: cannot open shared object file:

Related topics