Hi,
I am trying to run HPL benchmark as -
[user@node1 ~]$ docker container run --interactive --privileged --tty nvcr.io/nvidia/hpc-benchmarks:21.4-hpl /bin/bash
Detected MOFED 5.4-1.0.3.
NOTE: MOFED driver was detected, but nv_peer_mem driver was not detected.
Multi-node communication performance may be reduced.
root@d112f848d074:/workspace# mpirun --bind-to none -np 4 hpl.sh --config dgx-a100 --dat HPL.dat
INFO: host=d112f848d074 rank=0 lrank=0 cores=16 gpu=0 cpu=32-47 mem=2 net=mlx5_0:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=d112f848d074 rank=1 lrank=1 cores=16 gpu=1 cpu=48-63 mem=3 net=mlx5_1:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=d112f848d074 rank=2 lrank=2 cores=16 gpu=2 cpu=0-15 mem=0 net=mlx5_2:1 bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=d112f848d074 rank=3 lrank=3 cores=16 gpu=3 cpu=16-31 mem=1 net=mlx5_3:1 bin=/workspace/hpl-linux-x86_64/xhpl
/workspace/hpl-linux-x86_64/xhpl: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
/workspace/hpl-linux-x86_64/xhpl: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
/workspace/hpl-linux-x86_64/xhpl: error while loading shared libraries: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[21697,1],1]
Exit code: 127
--------------------------------------------------------------------------
i can see that the libnvidia-ml.so file is not being detected inside the container -
root@d112f848d074:/workspace# ldd /workspace/hpl-linux-x86_64/xhpl
linux-vdso.so.1 (0x00008000000d2000)
libmkl_intel_lp64.so => /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_intel_lp64.so (0x00007ffff70b3000)
libmkl_intel_thread.so => /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_intel_thread.so (0x00007ffff37c8000)
libmkl_core.so => /opt/intel/compilers_and_libraries_2020.4.304/linux/mkl/lib/intel64_lin/libmkl_core.so (0x00007fffef187000)
libcudart.so.11.0 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0 (0x00007fffeeef8000)
libcublas.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.11 (0x00007fffe7f35000)
libnvidia-ml.so.1 => not found
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fffe7b97000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fffe7993000)
libmpi.so.40 => /usr/local/openmpi/lib/libmpi.so.40 (0x00007fffe765f000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fffe7440000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fffe704f000)
/lib64/ld-linux-x86-64.so.2 (0x00007ffff7dd3000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fffe6e47000)
libcublasLt.so.11 => /usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.11 (0x00007fffdb266000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fffdaedd000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fffdacc5000)
libopen-rte.so.40 => /usr/local/openmpi/lib/libopen-rte.so.40 (0x00007fffdaa0f000)
libopen-pal.so.40 => /usr/local/openmpi/lib/libopen-pal.so.40 (0x00007fffda6f5000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fffda4f2000)
in addition the nvidia-smi utility is also unavailable in the container -
root@d112f848d074:/workspace# nvidia-smi
bash: nvidia-smi: command not found
what additional steps do i need to take to get libnvidia-ml.so.1 library and nvidia-smi command inside the container?