How to enable nvlink sharp busBW on a 8xH100 system, if one GPU is lost

We recently get a 8xH100 + 2x8468CPU, unfortunatly, one GPU cant be detected by the driver, so the topology is


We are carrying a test on bus bandwidth with nvlink sharp on this system, but we get a busBW around 375 even with NCCL_ALGO=NVLS.
We also test the code multi-gpu-programming-models/multi_node_p2p at master · NVIDIA/multi-gpu-programming-models · GitHub which leverages nvlink sharp to calculate the L2_norm, but the code gives a error

according to the cuda driver API CUDA Driver API :: CUDA Toolkit Documentation
it simply means the IMEX channel is not correct configured.

My question is
if one GPU is lost on a 8xh100 system, is it still possible to enable nvlink sharp?
If so, how to enable it? manually configure the driver by using nvidia-modprobe?

Hi @buaastv_yzl ,

I don’t know that those samples necessarily rely on SHARP, but nonetheless the error you’re seeing with the missing GPU should be avoidable. :-)

If you set CUDA_VISIBLE_DEVICES to 0,1,2,3 does it work (albeit with 4 GPUs)?

Why not make a support case to get that other GPU back online? (See DGX User Support for how to make a case with NVIDIA Enterprise Support).