Nvshmem error in docker HPL benchmark

marco.faltelli1 · December 3, 2024, 3:16pm

Hi,

I’m trying to run the NVIDIA HPL benchmarks as explained in NVIDIA HPC-Benchmarks | NVIDIA NGC .
I’m trying to do this on top of a VM with a vGPU attached (MIG mode).
If I try to run hpl.sh, I get the following errors:

HPL-NVIDIA settings from environment variables:
--- DEVICE INFO ---
  Peak clock frequency: 1410 MHz
  SM version          : 80
  Number of SMs       : 42
-------------------
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/mem/mem.cpp:298: non-zero status: 801 cuMemGetAllocationGranularity failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:966: non-zero status: 7 nvshmem setup local heap failed 

[HPL TRACE] cuda_nvshmem_init: max=0.0414 (0) min=0.0414 (0)
[WARNING] Change Input N 92800 to 92160
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/mem/mem.cpp:298: non-zero status: 801 cuMemGetAllocationGranularity failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:966: non-zero status: 7 nvshmem setup local heap failed 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Resource temporarily unavailable, exiting... mutex destroy failed

This is how I launch the container:
sudo docker run --rm --runtime=nvidia --gpus all --shm-size=1g --privileged -i -t nvcr.io/nvidia/hpc-benchmarks:24.09 /bin/bash

This is how I launch the HPL benchmark:
./hpl.sh --dat hpl-linux-x86_64/sample-dat/HPL-1GPU.dat

Can you please help me? What is going wrong with nvshmem?

Kind regards,

fik · December 25, 2024, 12:04pm

I’m getting similar error, not sure what is wrong:

HPL-NVIDIA settings from environment variables:
--- DEVICE INFO ---
  Peak clock frequency: 1733 MHz
  SM version          : 61
  Number of SMs       : 15
-------------------
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:nvshmemi_get_mem_handle:79: Unable to access device state. 500

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:nvshmemi_get_mem_handle:85: Unable to access ibgda device state. 500

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:952: NULL value Unable to query pointer information.

[HPL TRACE] cuda_nvshmem_init: max=0.0018 (0) min=0.0018 (0)
[WARNING] Change Input N 92800 to 92160
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:nvshmemi_get_mem_handle:79: Unable to access device state. 500

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/device/init/init_device.cu:nvshmemi_get_mem_handle:85: Unable to access ibgda device state. 500

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:952: NULL value Unable to query pointer information.

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/host/init/init.cu:nvshmemi_check_state_and_init:1062: nvshmem initialization failed, exiting 

/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/main_nvshmem/src/util/cs.cpp:23: non-zero status: 16: Cannot allocate memory, exiting... mutex destroy failed

Topic		Replies	Views
Nvidia docker nvcr.io/nvidia/hpc-benchmarks:23.10 HPL running error at HPC ARM Developer-kit Container: HPC cuda	2	1169	February 22, 2024
Run HPL on 4x A100 CUDA Programming and Performance	3	3009	July 17, 2021
Run hpc_benchmark23.10 HPL with v100GPU GPU-Accelerated Libraries hpc , benchmarks , hpc-x	3	1413	January 25, 2024
How to run HPL script over Ethernet nvc, nvc++ and nvfortran hpc	5	478	June 25, 2024
HPL NGC Container - libnvidia-ml.so.1: cannot open shared object file: NGC GPU Cloud	1	9729	October 1, 2021
Error while running NVIDIA HPL benchmark for H100 GPU-Accelerated Libraries	1	1048	April 2, 2024
Nvshmem fails to finalize GPU-Accelerated Libraries cuda , nvshmem	4	862	January 16, 2024
NVSHMEM runtime initialization GPU-Accelerated Libraries nvshmem	1	76	November 14, 2024
Raise error when link nvshmem in my application Legacy PGI Compilers cuda , cudnn	13	1175	January 2, 2024
HPC Container HPL-21.4 MPI_Recv error Container: HPC	5	2282	March 24, 2022

Nvshmem error in docker HPL benchmark

Related topics