Hi,
I’m trying to run the NVIDIA HPL benchmarks as explained in NVIDIA HPC-Benchmarks | NVIDIA NGC .
I’m trying to do this on top of a VM with a vGPU attached (MIG mode).
If I try to run hpl.sh, I get the following errors:
HPL-NVIDIA settings from environment variables:
--- DEVICE INFO ---
Peak clock frequency: 1410 MHz
SM version : 80
Number of SMs : 42
-------------------
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/mem/mem.cpp:298: non-zero status: 801 cuMemGetAllocationGranularity failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:966: non-zero status: 7 nvshmem setup local heap failed
[HPL TRACE] cuda_nvshmem_init: max=0.0414 (0) min=0.0414 (0)
[WARNING] Change Input N 92800 to 92160
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/mem/mem.cpp:298: non-zero status: 801 cuMemGetAllocationGranularity failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:966: non-zero status: 7 nvshmem setup local heap failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Resource temporarily unavailable, exiting... mutex destroy failed
This is how I launch the container:
sudo docker run --rm --runtime=nvidia --gpus all --shm-size=1g --privileged -i -t nvcr.io/nvidia/hpc-benchmarks:24.09 /bin/bash
This is how I launch the HPL benchmark:
./hpl.sh --dat hpl-linux-x86_64/sample-dat/HPL-1GPU.dat
Can you please help me? What is going wrong with nvshmem?
Kind regards,